date:20160914

[jira] [Updated] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17441:

Assignee: Xiao Li

> Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data 
> source table
> -
>
> Key: SPARK-17441
> URL: https://issues.apache.org/jira/browse/SPARK-17441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
> like the other `ALTER PARTITION` commands. We should issue an exception 
> instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17440.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15004
[https://github.com/apache/spark/pull/15004]

> Issue Exception when ALTER TABLE commands try to alter a VIEW
> -
>
> Key: SPARK-17440
> URL: https://issues.apache.org/jira/browse/SPARK-17440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> For the following `ALTER TABLE` DDL, we should issue an exception when the 
> target table is a `VIEW`:
> {code}
>  ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'
>  ALTER TABLE viewName SET SERDE 'whatever'
>  ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')
>  ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')
>  ALTER TABLE viewName RECOVER PARTITIONS
>  ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
> b='p')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table

2016-09-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-17441.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15004
[https://github.com/apache/spark/pull/15004]

> Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data 
> source table
> -
>
> Key: SPARK-17441
> URL: https://issues.apache.org/jira/browse/SPARK-17441
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.1.0
>
>
> `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just 
> like the other `ALTER PARTITION` commands. We should issue an exception 
> instead. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW

2016-09-14 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-17440:

Assignee: Xiao Li

> Issue Exception when ALTER TABLE commands try to alter a VIEW
> -
>
> Key: SPARK-17440
> URL: https://issues.apache.org/jira/browse/SPARK-17440
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>
> For the following `ALTER TABLE` DDL, we should issue an exception when the 
> target table is a `VIEW`:
> {code}
>  ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart'
>  ALTER TABLE viewName SET SERDE 'whatever'
>  ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y')
>  ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8')
>  ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2')
>  ALTER TABLE viewName RECOVER PARTITIONS
>  ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', 
> b='p')
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492494#comment-15492494
 ] 

Apache Spark commented on SPARK-17543:
--

User 'jagadeesanas2' has created a pull request for this issue:
https://github.com/apache/spark/pull/15108

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Priority: Trivial
>  Labels: starter
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17543:


Assignee: (was: Apache Spark)

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Priority: Trivial
>  Labels: starter
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17543:


Assignee: Apache Spark

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: starter
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17537) Improve performance for reading parquet schema

2016-09-14 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492475#comment-15492475
 ] 

Hyukjin Kwon commented on SPARK-17537:
--

This should be a duplicate of SPARK-17071.


> Improve performance for reading parquet schema
> --
>
> Key: SPARK-17537
> URL: https://issues.apache.org/jira/browse/SPARK-17537
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Yang Wang
>Priority: Minor
>
> spark.read.parquet would issue a spark job to read parquet schema. When 
> spark.sql.parquet.mergeSchema are set to false (the default value), there is 
> often only one file to read, so there is no need to issue a spark job to do 
> it. In this case, we can read it from driver directly instead of issuing a 
> spark job. This could reduce the infer schema latency from several hundreds 
> milliseconds to around ten milliseconds in my environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-14 Thread Jagadeesan A S (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492406#comment-15492406
 ] 

Jagadeesan A S commented on SPARK-17543:


Started working on this.

> Missing log4j config file for tests in common/network-shuffle
> -
>
> Key: SPARK-17543
> URL: https://issues.apache.org/jira/browse/SPARK-17543
> Project: Spark
>  Issue Type: Bug
>Reporter: Frederick Reiss
>Priority: Trivial
>  Labels: starter
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> The Maven module {{common/network-shuffle}} does not have a log4j 
> configuration file for its test cases. Usually these configuration files are 
> located inside each module, in the directory {{src/test/resources}}. The 
> missing configuration file leads to a scary-looking but harmless series of 
> errors and stack traces in Spark build logs:
> {noformat}
> SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
> SLF4J: Defaulting to no-operation (NOP) logger implementation
> SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
> details.
> log4j:ERROR Could not read configuration file from URL 
> [file:src/test/resources/log4j.properties].
> java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
> file or directory)
> at java.io.FileInputStream.open(Native Method)
> at java.io.FileInputStream.(FileInputStream.java:146)
> at java.io.FileInputStream.(FileInputStream.java:101)
> at 
> sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
> at 
> sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
> at 
> org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
> at 
> org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
> at org.apache.log4j.LogManager.(LogManager.java:127)
> at org.apache.log4j.Logger.getLogger(Logger.java:104)
> at 
> io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
> at 
> io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17551) support null ordering for DataFrame API

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17551:


Assignee: Apache Spark

> support null ordering for DataFrame API
> ---
>
> Key: SPARK-17551
> URL: https://issues.apache.org/jira/browse/SPARK-17551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>Assignee: Apache Spark
>
> SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for 
> SQL interface. This JIRA is to complete this feature by adding same support 
> for DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17551) support null ordering for DataFrame API

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492342#comment-15492342
 ] 

Apache Spark commented on SPARK-17551:
--

User 'xwu0226' has created a pull request for this issue:
https://github.com/apache/spark/pull/15107

> support null ordering for DataFrame API
> ---
>
> Key: SPARK-17551
> URL: https://issues.apache.org/jira/browse/SPARK-17551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for 
> SQL interface. This JIRA is to complete this feature by adding same support 
> for DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17551) support null ordering for DataFrame API

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17551:


Assignee: (was: Apache Spark)

> support null ordering for DataFrame API
> ---
>
> Key: SPARK-17551
> URL: https://issues.apache.org/jira/browse/SPARK-17551
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Xin Wu
>
> SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for 
> SQL interface. This JIRA is to complete this feature by adding same support 
> for DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17551) support null ordering for DataFrame API

2016-09-14 Thread Xin Wu (JIRA)

Xin Wu created SPARK-17551:
--

 Summary: support null ordering for DataFrame API
 Key: SPARK-17551
 URL: https://issues.apache.org/jira/browse/SPARK-17551
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.1.0
Reporter: Xin Wu


SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for SQL 
interface. This JIRA is to complete this feature by adding same support for 
DataFrame/Dataset APIs. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangJianfei updated SPARK-17535:

Comment: was deleted

(was: First: this logic and code is simple too.
Second:After JDK1.5, The volatile can avoid the situation you say.We can't get 
into some strange situations where the reference is available but the 
synchronized init hasn't completed,because the volatile will tell the JVM to 
not do instruction reordering about the init of sparkContext.)

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492084#comment-15492084
 ] 

WangJianfei commented on SPARK-17535:
-

First: this logic and code is simple too.
Second:After JDK1.5, The volatile can avoid the situation you say.We can't get 
into some strange situations where the reference is available but the 
synchronized init hasn't completed,because the volatile will tell the JVM to 
not do instruction reordering about the init of sparkContext.

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492046#comment-15492046
 ] 

WangJianfei edited comment on SPARK-17535 at 9/15/16 2:10 AM:
--

First: this logic and code is simple too.
Second:After JDK1.5, The volatile can avoid the situation you say.We can't get 
into some strange situations where the reference is available but the 
synchronized init hasn't completed,because the volatile will tell the JVM to 
not do instruction reordering about the init of sparkContext.


was (Author: codlife):
After JDK1.5, The volatile can avoid the situation you say.We can't get into 
some strange situations where the reference is available but the synchronized 
init hasn't completed,because the volatile will tell the JVM to not do 
instruction reordering about the init of sparkContext.

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

WangJianfei updated SPARK-17535:

Comment: was deleted

(was: After JDK1.5, The volatile can avoid the situation you say.We can't get 
into some strange situations where the reference is available but the 
synchronized init hasn't completed,because the volatile will tell the JVM to 
not do instruction reordering about the init of sparkContext.)

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492046#comment-15492046
 ] 

WangJianfei commented on SPARK-17535:
-

After JDK1.5, The volatile can avoid the situation you say.We can't get into 
some strange situations where the reference is available but the synchronized 
init hasn't completed,because the volatile will tell the JVM to not do 
instruction reordering about the init of sparkContext.

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread WangJianfei (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492047#comment-15492047
 ] 

WangJianfei commented on SPARK-17535:
-

After JDK1.5, The volatile can avoid the situation you say.We can't get into 
some strange situations where the reference is available but the synchronized 
init hasn't completed,because the volatile will tell the JVM to not do 
instruction reordering about the init of sparkContext.

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml

2016-09-14 Thread yuhao yang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491974#comment-15491974
 ] 

yuhao yang commented on SPARK-15573:


This sounds feasible. Two primary work items as I see:
1. Find a scandalized way to save models for different versions.
2. How to ensure model loading correctness for all the versions. (there might 
be parameter values change across versions).

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-14 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491938#comment-15491938
 ] 

Josh Rosen commented on SPARK-17544:


I found a similar issue from the {{spark-avro}} repository: 
https://github.com/databricks/spark-avro/issues/156



> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
>   
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolve

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491841#comment-15491841
 ] 

Bryan Cutler commented on SPARK-17508:
--

To respond to [~srowen] question, I think it's reasonable to have {{None}} act 
the same as not setting the param.  I couldn't find any current params where 
this would cause a problem.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17549:
---
Attachment: spark-2.0.patch

Attaching a Spark 2 patch that silences the error (looks like a Janino bug and 
is in an area that should not affect functionality from what I understand).

It'd be great if someone more familiar with this area could take a look at 
whether my changes are sane, before I go out and file a PR.

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apac

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491819#comment-15491819
 ] 

Michael Armbrust commented on SPARK-16407:
--

I think it is likely that we will want to make it easy for users to instantiate 
Sources / Sinks programmatically.  I have two concerns with merging this 
particular PR at this time:
 - The source/sink APIs are in a lot of flux.  The fact that they are not 
labeled experimental in 2.0.0 was an unfortunate oversight that we'll correct 
in 2.0.1.  For example [~freiss] pointed out that it was impossible to know 
when to flush state from the source.  Another example is we'll probably have to 
make changes to allow for rate limiting.  Until we have a better idea of what 
the final API will look like, I would not want to expose it in the end-user 
APIs like DataStreamWriter.
 - When we added the {{foreach}} API, it was a conscious choice to avoid 
exposing dataframes / batches to the user.  The idea is we could adapt code to 
a non-microbatch model eventually if we wanted.  This API would break that.  
Its not out of the question, but thats a larger shift.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17549:
---
Attachment: spark-1.6-2.patch

Just noticed there's already a more accurate count of the batch size in the 
code, so uploading an updated patch.

I'll try to figure out what's wrong in Spark 2 but not really familiar with 
that area of the code where the exceptions are coming from.

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$

[jira] [Created] (SPARK-17550) DataFrameWriter.partitionBy() should throw exception if column is not present in Dataframe

2016-09-14 Thread Aniket Kulkarni (JIRA)

Aniket Kulkarni created SPARK-17550:
---

 Summary: DataFrameWriter.partitionBy() should throw exception if 
column is not present in Dataframe
 Key: SPARK-17550
 URL: https://issues.apache.org/jira/browse/SPARK-17550
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aniket Kulkarni
Priority: Minor


I have a spark job which performs certain computations on event data and 
eventually persists it to hive. 

I was trying to write to hive using the code snippet shown below : 
dataframe.write.format("orc").partitionBy(col1,col2).options(options).mode(SaveMode.Append).saveAsTable(hiveTable)

The write to hive was not working as col2 in the above example was not present 
in the dataframe. It was a little tedious to debug this as no exception or 
message showed up in the logs. I was constantly seeing executor lost failures 
in the logs and nothing more.

I think there should be an exception thrown when one tries to write to hive on 
a partitioning column that does not exist.

If this is indeed something that needs to be fixed, I would like to volunteer 
to fix this in the spark-core code base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491787#comment-15491787
 ] 

holdenk commented on SPARK-16407:
-

That's part of why I decided to just use the ForeachRDD sink as testing 
artifact and not expose it.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491784#comment-15491784
 ] 

holdenk commented on SPARK-16407:
-

It's true it doesn't work in SQL - but I don't think the current stream writer 
interface works so well in SQL anyways.

While its true this doesn't yet expose a Python API - thats generally true for 
structured streaming. Once we do add a Python API its possible one could wrap 
Scala sinks that do custom callbacks to Python (similar to how the current 
Python streaming API works) or otherwise provide wrappers for JVM sinks as is 
the general process for a lot of PySpark.

I'm not advocating for this to replace the string based API but to compliment 
it to allow more flexibility for users (as demonstrated in the provided test).

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491779#comment-15491779
 ] 

Reynold Xin commented on SPARK-16407:
-

Actually I spoke too soon. I only read the code change and didn't read the 
description on the pr. I think the high level idea to pass in a sink object 
makes sense, but I don't think the goal should be to support foreachRdd, 
because that goes directly against the spirit of one of the high level goals of 
structured streaming: decouple the logical plan from the physical execution, 
which includes not exposing batches to the end-user.


> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491772#comment-15491772
 ] 

holdenk commented on SPARK-16407:
-

Right the simplest example where you need to use the typed API is with the 
provided ForeachSink that is used for testing.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16439:


Assignee: Apache Spark  (was: Maciej Bryński)

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16439:


Assignee: Maciej Bryński  (was: Apache Spark)

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491766#comment-15491766
 ] 

Apache Spark commented on SPARK-16439:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15106

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-14 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17549:
---
Attachment: spark-1.6.patch
example_1.6_pre_patch.png
example_1.6_post_patch.png
create_parquet.scala

> InMemoryRelation doesn't scale to large tables
> --
>
> Key: SPARK-17549
> URL: https://issues.apache.org/jira/browse/SPARK-17549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
> Attachments: create_parquet.scala, example_1.6_post_patch.png, 
> example_1.6_pre_patch.png, spark-1.6.patch
>
>
> An {{InMemoryRelation}} is created when you cache a table; but if the table 
> is large, defined by either having a really large amount of columns, or a 
> really large amount of partitions (in the file split sense, not the "table 
> partition" sense), or both, it causes an immense amount of memory to be used 
> in the driver.
> The reason is that it uses an accumulator to collect statistics about each 
> partition, and instead of summarizing the data in the driver, it keeps *all* 
> entries in memory.
> I'm attaching a script I used to create a parquet file with 20,000 columns 
> and a single row, which I then copied 500 times so I'd have 500 partitions.
> When doing the following:
> {code}
> sqlContext.read.parquet(...).count()
> {code}
> Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
> settings I used, but it works.)
> I ran spark-shell like this:
> {code}
> ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g 
> --conf spark.executor.memory=2g
> {code}
> And ran:
> {code}
> sqlContext.read.parquet(...).cache().count()
> {code}
> You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
> partitions were processed, there were 40 GenericInternalRow objects with
> 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage 
> was:
> {code}
>   40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
> {code}
> (Note: Integer = 20 bytes, Long = 24 bytes.)
> If I waited until the end, there would be 500 partitions, so ~ 5GB of memory 
> to hold the stats.
> I'm also attaching a patch I made on top of 1.6 that uses just a long 
> accumulator to capture the table size; with that patch memory usage on the 
> driver doesn't keep growing. Also note in the patch that I'm multiplying the 
> column size by the row count, which I think is a different bug in the 
> existing code (those stats should be for the whole batch, not just a single 
> row, right?). I also added {{example_1.6_post_patch.png}} to show the 
> {{InMemoryRelation}} with the patch.
> I also applied a very similar patch on top of Spark 2.0. But there things 
> blow up even more spectacularly when I try to run the count on the cached 
> table. It starts with this error:
> {noformat}
> 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
> vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
> java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
> Index: 63235, Size: 1
> (lots of generated code here...)
> Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
>   at java.util.ArrayList.rangeCheck(ArrayList.java:635)
>   at java.util.ArrayList.get(ArrayList.java:411)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
>   at 
> org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
>   at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
>   at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
>   at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
>   at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>   at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
>   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
>   ... 54 more
> {noformat}
> And basically a

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491752#comment-15491752
 ] 

Reynold Xin commented on SPARK-16407:
-

This doesn't work in SQL, Python, etc.

I like the general idea of providing a way to make the complex things possible, 
but we shouldn't make this the default way and allow developers to use this as 
an excuse to build APIs that are not compatible across languages.

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reopened SPARK-16439:


We could bring the seperator back for better readability. 

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17549) InMemoryRelation doesn't scale to large tables

2016-09-14 Thread Marcelo Vanzin (JIRA)

Marcelo Vanzin created SPARK-17549:
--

 Summary: InMemoryRelation doesn't scale to large tables
 Key: SPARK-17549
 URL: https://issues.apache.org/jira/browse/SPARK-17549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0, 1.6.0
Reporter: Marcelo Vanzin


An {{InMemoryRelation}} is created when you cache a table; but if the table is 
large, defined by either having a really large amount of columns, or a really 
large amount of partitions (in the file split sense, not the "table partition" 
sense), or both, it causes an immense amount of memory to be used in the driver.

The reason is that it uses an accumulator to collect statistics about each 
partition, and instead of summarizing the data in the driver, it keeps *all* 
entries in memory.

I'm attaching a script I used to create a parquet file with 20,000 columns and 
a single row, which I then copied 500 times so I'd have 500 partitions.

When doing the following:

{code}
sqlContext.read.parquet(...).count()
{code}

Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the 
settings I used, but it works.)

I ran spark-shell like this:

{code}
./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g --conf 
spark.executor.memory=2g
{code}

And ran:

{code}
sqlContext.read.parquet(...).cache().count()
{code}

You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 
partitions were processed, there were 40 GenericInternalRow objects with
100,000 items each (5 stat info fields * 20,000 columns). So, memory usage was:

{code}
  40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB
{code}

(Note: Integer = 20 bytes, Long = 24 bytes.)

If I waited until the end, there would be 500 partitions, so ~ 5GB of memory to 
hold the stats.


I'm also attaching a patch I made on top of 1.6 that uses just a long 
accumulator to capture the table size; with that patch memory usage on the 
driver doesn't keep growing. Also note in the patch that I'm multiplying the 
column size by the row count, which I think is a different bug in the existing 
code (those stats should be for the whole batch, not just a single row, 
right?). I also added {{example_1.6_post_patch.png}} to show the 
{{InMemoryRelation}} with the patch.


I also applied a very similar patch on top of Spark 2.0. But there things blow 
up even more spectacularly when I try to run the count on the cached table. It 
starts with this error:

{noformat}
14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, 
vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: 
java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: 
Index: 63235, Size: 1
(lots of generated code here...)
Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1
at java.util.ArrayList.rangeCheck(ArrayList.java:635)
at java.util.ArrayList.get(ArrayList.java:411)
at 
org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556)
at 
org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572)
at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513)
at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644)
at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623)
at org.codehaus.janino.util.ClassFile.(ClassFile.java:280)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883)
... 54 more
{noformat}

And basically a lot of that going on making the output unreadable, so I just 
killed the shell. Anyway, I believe the same fix should work there, but I can't 
be sure because the test doesn't work for different reasons, it seems.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details

2016-09-14 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491744#comment-15491744
 ] 

Davies Liu commented on SPARK-16439:


The separator was added on purpose, otherwise it's very difficult to read the 
numbers (especially the number of rows),  we need to count the number of digits 
to realize how large the number is. I think we should still keep that and fix 
the local issue (using English should be enough), I will send a PR to add it 
back.

> Incorrect information in SQL Query details
> --
>
> Key: SPARK-16439
> URL: https://issues.apache.org/jira/browse/SPARK-16439
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Web UI
>Affects Versions: 2.0.0
>Reporter: Maciej Bryński
>Assignee: Maciej Bryński
>Priority: Minor
> Fix For: 2.0.0
>
> Attachments: sample.png, spark.jpg
>
>
> One picture is worth a thousand words.
> Please see attachment
> Incorrect values are in fields:
> * data size
> * number of output rows
> * time to collect



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders

2016-09-14 Thread Shixiong Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491739#comment-15491739
 ] 

Shixiong Zhu commented on SPARK-16407:
--

Right now we don't want to add such typed API to DataStreamWriter as it's not 
easy to port to other languages. Could you give us an example that you must use 
a typed API?

Also cc [~rxin] [~marmbrus]

> Allow users to supply custom StreamSinkProviders
> 
>
> Key: SPARK-16407
> URL: https://issues.apache.org/jira/browse/SPARK-16407
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: holdenk
>
> The current DataStreamWriter allows users to specify a class name as format, 
> however it could be easier for people to directly pass in a specific provider 
> instance - e.g. for user equivalent of ForeachSink or other sink with 
> non-string parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17548:


Assignee: (was: Apache Spark)

> Word2VecModel.findSynonyms can spuriously reject the best match when invoked 
> with a vector
> --
>
> Key: SPARK-17548
> URL: https://issues.apache.org/jira/browse/SPARK-17548
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: any
>Reporter: William Benton
>Priority: Minor
>
> The `findSynonyms` method in `Word2VecModel` currently rejects the best match 
> a priori. When `findSynonyms` is invoked with a word, the best match is 
> almost certain to be that word, but `findSynonyms` can also be invoked with a 
> vector, which might not correspond to any of the words in the model's 
> vocabulary.  In the latter case, rejecting the best match is spurious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491626#comment-15491626
 ] 

Apache Spark commented on SPARK-17548:
--

User 'willb' has created a pull request for this issue:
https://github.com/apache/spark/pull/15105

> Word2VecModel.findSynonyms can spuriously reject the best match when invoked 
> with a vector
> --
>
> Key: SPARK-17548
> URL: https://issues.apache.org/jira/browse/SPARK-17548
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: any
>Reporter: William Benton
>Priority: Minor
>
> The `findSynonyms` method in `Word2VecModel` currently rejects the best match 
> a priori. When `findSynonyms` is invoked with a word, the best match is 
> almost certain to be that word, but `findSynonyms` can also be invoked with a 
> vector, which might not correspond to any of the words in the model's 
> vocabulary.  In the latter case, rejecting the best match is spurious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17548:


Assignee: Apache Spark

> Word2VecModel.findSynonyms can spuriously reject the best match when invoked 
> with a vector
> --
>
> Key: SPARK-17548
> URL: https://issues.apache.org/jira/browse/SPARK-17548
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0
> Environment: any
>Reporter: William Benton
>Assignee: Apache Spark
>Priority: Minor
>
> The `findSynonyms` method in `Word2VecModel` currently rejects the best match 
> a priori. When `findSynonyms` is invoked with a word, the best match is 
> almost certain to be that word, but `findSynonyms` can also be invoked with a 
> vector, which might not correspond to any of the words in the model's 
> vocabulary.  In the latter case, rejecting the best match is spurious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector

2016-09-14 Thread William Benton (JIRA)

William Benton created SPARK-17548:
--

 Summary: Word2VecModel.findSynonyms can spuriously reject the best 
match when invoked with a vector
 Key: SPARK-17548
 URL: https://issues.apache.org/jira/browse/SPARK-17548
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1
 Environment: any
Reporter: William Benton
Priority: Minor


The `findSynonyms` method in `Word2VecModel` currently rejects the best match a 
priori. When `findSynonyms` is invoked with a word, the best match is almost 
certain to be that word, but `findSynonyms` can also be invoked with a vector, 
which might not correspond to any of the words in the model's vocabulary.  In 
the latter case, rejecting the best match is spurious.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17547:


Assignee: Josh Rosen  (was: Apache Spark)

> Temporary shuffle data files may be leaked following exception in write
> ---
>
> Key: SPARK-17547
> URL: https://issues.apache.org/jira/browse/SPARK-17547
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.5.3, 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> SPARK-8029 modified shuffle writers to first stage their data to a temporary 
> file in the same directory as the final destination file and then to 
> atomically rename the file at the end of the write job. However, this change 
> introduced the potential for the temporary output file to be leaked if an 
> exception occurs during the write because the shuffle writers' existing error 
> cleanup code doesn't handle this new temp file.
> This is easy to fix: we just need to add a {{finally}} block to ensure that 
> the temporary file is guaranteed to be either moved or deleted before 
> existing the shuffle write method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491548#comment-15491548
 ] 

Apache Spark commented on SPARK-17547:
--

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/15104

> Temporary shuffle data files may be leaked following exception in write
> ---
>
> Key: SPARK-17547
> URL: https://issues.apache.org/jira/browse/SPARK-17547
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.5.3, 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>
> SPARK-8029 modified shuffle writers to first stage their data to a temporary 
> file in the same directory as the final destination file and then to 
> atomically rename the file at the end of the write job. However, this change 
> introduced the potential for the temporary output file to be leaked if an 
> exception occurs during the write because the shuffle writers' existing error 
> cleanup code doesn't handle this new temp file.
> This is easy to fix: we just need to add a {{finally}} block to ensure that 
> the temporary file is guaranteed to be either moved or deleted before 
> existing the shuffle write method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17547:


Assignee: Apache Spark  (was: Josh Rosen)

> Temporary shuffle data files may be leaked following exception in write
> ---
>
> Key: SPARK-17547
> URL: https://issues.apache.org/jira/browse/SPARK-17547
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.5.3, 1.6.0, 2.0.0
>Reporter: Josh Rosen
>Assignee: Apache Spark
>
> SPARK-8029 modified shuffle writers to first stage their data to a temporary 
> file in the same directory as the final destination file and then to 
> atomically rename the file at the end of the write job. However, this change 
> introduced the potential for the temporary output file to be leaked if an 
> exception occurs during the write because the shuffle writers' existing error 
> cleanup code doesn't handle this new temp file.
> This is easy to fix: we just need to add a {{finally}} block to ensure that 
> the temporary file is guaranteed to be either moved or deleted before 
> existing the shuffle write method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write

2016-09-14 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-17547:
--

 Summary: Temporary shuffle data files may be leaked following 
exception in write
 Key: SPARK-17547
 URL: https://issues.apache.org/jira/browse/SPARK-17547
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 2.0.0, 1.6.0, 1.5.3
Reporter: Josh Rosen
Assignee: Josh Rosen


SPARK-8029 modified shuffle writers to first stage their data to a temporary 
file in the same directory as the final destination file and then to atomically 
rename the file at the end of the write job. However, this change introduced 
the potential for the temporary output file to be leaked if an exception occurs 
during the write because the shuffle writers' existing error cleanup code 
doesn't handle this new temp file.

This is easy to fix: we just need to add a {{finally}} block to ensure that the 
temporary file is guaranteed to be either moved or deleted before existing the 
shuffle write method.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491518#comment-15491518
 ] 

Apache Spark commented on SPARK-17100:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/15103

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17100:


Assignee: Davies Liu  (was: Apache Spark)

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17100:


Assignee: Apache Spark  (was: Davies Liu)

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Apache Spark
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17346) Kafka 0.10 support in Structured Streaming

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491487#comment-15491487
 ] 

Apache Spark commented on SPARK-17346:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/15102

> Kafka 0.10 support in Structured Streaming
> --
>
> Key: SPARK-17346
> URL: https://issues.apache.org/jira/browse/SPARK-17346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Implement Kafka 0.10-based sources and sinks for Structured Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17346) Kafka 0.10 support in Structured Streaming

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17346:


Assignee: Apache Spark

> Kafka 0.10 support in Structured Streaming
> --
>
> Key: SPARK-17346
> URL: https://issues.apache.org/jira/browse/SPARK-17346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>Assignee: Apache Spark
>
> Implement Kafka 0.10-based sources and sinks for Structured Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17346) Kafka 0.10 support in Structured Streaming

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17346:


Assignee: (was: Apache Spark)

> Kafka 0.10 support in Structured Streaming
> --
>
> Key: SPARK-17346
> URL: https://issues.apache.org/jira/browse/SPARK-17346
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Frederick Reiss
>
> Implement Kafka 0.10-based sources and sinks for Structured Streaming.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Jeff Nadler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491460#comment-15491460
 ] 

Jeff Nadler commented on SPARK-17510:
-

That would be incredible, thank you very much for looking into this.

If we were able to set maxRate independently I think we'd just use Direct 
Streams for both topics, with no maxRate for the session stream, and our normal 
maxRate for the events.



> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491455#comment-15491455
 ] 

Cody Koeninger commented on SPARK-17510:


Ok, next time I get some free hacking time I can make a branch for you to try 
out, to see if it addresses the issue

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException

2016-09-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-17100:
--

Assignee: Davies Liu

> pyspark filter on a udf column after join gives 
> java.lang.UnsupportedOperationException
> ---
>
> Key: SPARK-17100
> URL: https://issues.apache.org/jira/browse/SPARK-17100
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3.
>Reporter: Tim Sell
>Assignee: Davies Liu
> Attachments: bug.py, test_bug.py
>
>
> In pyspark, when filtering on a udf derived column after some join types,
> the optimized logical plan results is a 
> java.lang.UnsupportedOperationException.
> I could not replicate this in scala code from the shell, just python. It is a 
> pyspark regression from spark 1.6.2.
> This can be replicated with: bin/spark-submit bug.py
> {code:python:title=bug.py}
> import pyspark.sql.functions as F
> from pyspark.sql import Row, SparkSession
> if __name__ == '__main__':
> spark = SparkSession.builder.appName("test").getOrCreate()
> left = spark.createDataFrame([Row(a=1)])
> right = spark.createDataFrame([Row(a=1)])
> df = left.join(right, on='a', how='left_outer')
> df = df.withColumn('b', F.udf(lambda x: 'x')(df.a))
> df = df.filter('b = "x"')
> df.explain(extended=True)
> {code}
> The output is:
> {code}
> == Parsed Logical Plan ==
> 'Filter ('b = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Analyzed Logical Plan ==
> a: bigint, b: string
> Filter (b#8 = x)
> +- Project [a#0L, (a#0L) AS b#8]
>+- Project [a#0L]
>   +- Join LeftOuter, (a#0L = a#3L)
>  :- LogicalRDD [a#0L]
>  +- LogicalRDD [a#3L]
> == Optimized Logical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> == Physical Plan ==
> java.lang.UnsupportedOperationException: Cannot evaluate expression: 
> (input[0, bigint, true])
> {code}
> It fails when the join is:
> * how='outer', on=column expression
> * how='left_outer', on=string or column expression
> * how='right_outer', on=string or column expression
> It passes when the join is:
> * how='inner', on=string or column expression
> * how='outer', on=string
> I made some tests to demonstrate each of these.
> Run with bin/spark-submit test_bug.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-09-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17465:
---
Fix Version/s: 2.1.0
   2.0.1

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3, 2.0.1, 2.1.0
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-14 Thread Brady Auen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491445#comment-15491445
 ] 

Brady Auen commented on SPARK-17544:


http://stackoverflow.com/questions/39498492/spark-and-amazon-emr-s3-connections-not-being-closed

> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
>   
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSourc

[jira] [Commented] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491437#comment-15491437
 ] 

Apache Spark commented on SPARK-17114:
--

User 'hvanhovell' has created a pull request for this issue:
https://github.com/apache/spark/pull/15101

> Adding a 'GROUP BY 1' where first column is literal results in wrong answer
> ---
>
> Key: SPARK-17114
> URL: https://issues.apache.org/jira/browse/SPARK-17114
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2, 2.0.0
>Reporter: Josh Rosen
>  Labels: correctness
>
> Consider the following example:
> {code}
> sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable")
> // The following query should return an empty result set because the `IN` 
> filter condition is always false for this single-row table.
> val withoutGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
> """)
> assert(withoutGroupBy.collect().isEmpty, "original query returned wrong 
> answer")
> // After adding a 'GROUP BY 1' the query result should still be empty because 
> we'd be grouping an empty table:
> val withGroupBy = sqlContext.sql("""
>   SELECT 'foo'
>   FROM mytable
>   WHERE int_col == 0
>   GROUP BY 1
> """)
> assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong 
> answer")
> {code}
> Here, this fails the second assertion by returning a single row. It appears 
> that running {{group by 1}} where column 1 is a constant causes filter 
> conditions to be ignored.
> Both PostgreSQL and SQLite return empty result sets for the query containing 
> the {{GROUP BY}}. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-09-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-17465:
---
Assignee: Xing Shi

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
>Assignee: Xing Shi
> Fix For: 1.6.3
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak

2016-09-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17465.

   Resolution: Fixed
Fix Version/s: 1.6.3

Issue resolved by pull request 15022
[https://github.com/apache/spark/pull/15022]

> Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may 
> lead to memory leak
> -
>
> Key: SPARK-17465
> URL: https://issues.apache.org/jira/browse/SPARK-17465
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0, 1.6.1, 1.6.2
>Reporter: Xing Shi
> Fix For: 1.6.3
>
>
> After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a 
> memory leak on my Spark streaming application.
> Here is the head of the heap histogram of my application, which has been 
> running about 160 hours:
> {code:borderStyle=solid}
>  num #instances #bytes  class name
> --
>1: 28094   71753976  [B
>2:   1188086   28514064  java.lang.Long
>3:   1183844   28412256  scala.collection.mutable.DefaultEntry
>4:102242   13098768  
>5:102242   12421000  
>6:  81849199032  
>7:388391584  [Lscala.collection.mutable.HashEntry;
>8:  81847514288  
>9:  66514874080  
>   10: 371973438040  [C
>   11:  64232445640  
>   12:  87731044808  java.lang.Class
>   13: 36869 884856  java.lang.String
>   14: 15715 848368  [[I
>   15: 13690 782808  [S
>   16: 18903 604896  
> java.util.concurrent.ConcurrentHashMap$HashEntry
>   17:13 426192  [Lscala.concurrent.forkjoin.ForkJoinTask;
> {code}
> It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* 
> have unexpected big numbers of instances. In fact, the numbers started 
> growing at streaming process began, and keep growing proportional to total 
> number of tasks.
> After some further investigation, I found that the problem is caused by some 
> inappropriate memory management in _releaseUnrollMemoryForThisTask_ and 
> _unrollSafely_ method of class 
> [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala].
> In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be 
> processed only with the parameter _memoryToRelease_ > 0:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537
> But in fact, if a task successfully unrolled all its blocks in memory by 
> _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to 
> zero:
> https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322
> So the result is, the memory saved in _unrollMemoryMap_ will be released, but 
> the key of that part of memory will never be removed from the hash map. The 
> hash table will keep increasing, while new tasks keep incoming. Although the 
> speed of increase is comparatively slow (about dozens of bytes per task), it 
> is possible that result into OOM after weeks or months.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17546) start-* scripts should use hostname --fqdn

2016-09-14 Thread Kevin Burton (JIRA)

Kevin Burton created SPARK-17546:


 Summary: start-* scripts should use hostname --fqdn 
 Key: SPARK-17546
 URL: https://issues.apache.org/jira/browse/SPARK-17546
 Project: Spark
  Issue Type: Bug
Affects Versions: 2.0.0
Reporter: Kevin Burton


The ./sbin/start-slaves.sh and ./sbin/start-master.sh scripts use just 
'hostname' and if /etc/hostname isn't using a fqdn then you don't get the fully 
qualified domain name.

If you upgrade your script to hostname --fqdn the problem is solved. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Jeff Nadler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491412#comment-15491412
 ] 

Jeff Nadler commented on SPARK-17510:
-

Well...  both streams use updateStateByKey.   The session one is simple+fast, 
the event one is complex+slow.

I have a branch now where I experiment with mapWithState.   I don't expect huge 
benefits because we need expiration, and I understand from the docs that this 
negates a lot of the benefit.   I'd still love to get some perf gains of course 
:)

Don't have a Kafka 0.10 cluster right now but could stand one up pretty quick.

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date with colon in offset

2016-09-14 Thread Nathan Beyer (JIRA)

Nathan Beyer created SPARK-17545:


 Summary: Spark SQL Catalyst doesn't handle ISO 8601 date with 
colon in offset
 Key: SPARK-17545
 URL: https://issues.apache.org/jira/browse/SPARK-17545
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Nathan Beyer


When parsing a CSV with a date/time column that contains a variant ISO 8601 
that doesn't include a colon in the offset, casting to Timestamp fails.

Here's a simple, example CSV content.
{quote}
time
"2015-07-20T15:09:23.736-0500"
"2015-07-20T15:10:51.687-0500"
"2015-11-21T23:15:01.499-0600"
{quote}

Here's the stack trace that results from processing this data.
{quote}
16/09/14 15:22:59 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown 
Source)
at 
org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source)
at 
org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown
 Source)
at 
javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422)
at 
javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417)
at 
javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327)
at 
org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140)
at 
org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287)
{quote}

Somewhat related, I believe Python standard libraries can produce this form of 
zone offset. The system I got the data from is written in Python.
https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17472) Better error message for serialization failures of large objects in Python

2016-09-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17472.

   Resolution: Fixed
Fix Version/s: 2.1.0

Issue resolved by pull request 15026
[https://github.com/apache/spark/pull/15026]

> Better error message for serialization failures of large objects in Python
> --
>
> Key: SPARK-17472
> URL: https://issues.apache.org/jira/browse/SPARK-17472
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Eric Liang
>Priority: Minor
> Fix For: 2.1.0
>
>
> {code}
> def run():
>   import numpy.random as nr
>   b = nr.bytes(8 * 10)
>   sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
> run()
> {code}
> Gives you the following error from pickle
> {code}
> error: 'i' format requires -2147483648 <= number <= 2147483647
> ---
> error Traceback (most recent call last)
>  in ()
>   4   sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
>   5 
> > 6 run()
>  in run()
>   2   import numpy.random as nr
>   3   b = nr.bytes(8 * 10)
> > 4   sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count()
>   5 
>   6 run()
> /databricks/spark/python/pyspark/rdd.pyc in count(self)
>1002 3
>1003 """
> -> 1004 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>1005 
>1006 def stats(self):
> /databricks/spark/python/pyspark/rdd.pyc in sum(self)
> 993 6.0
> 994 """
> --> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, 
> operator.add)
> 996 
> 997 def count(self):
> /databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op)
> 867 # zeroValue provided to each partition is unique from the one 
> provided
> 868 # to the final reduce call
> --> 869 vals = self.mapPartitions(func).collect()
> 870 return reduce(op, vals, zeroValue)
> 871 
> /databricks/spark/python/pyspark/rdd.pyc in collect(self)
> 769 """
> 770 with SCCallSiteSync(self.context) as css:
> --> 771 port = 
> self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
> 772 return list(_load_from_socket(port, self._jrdd_deserializer))
> 773 
> /databricks/spark/python/pyspark/rdd.pyc in _jrdd(self)
>2377 command = (self.func, profiler, self._prev_jrdd_deserializer,
>2378self._jrdd_deserializer)
> -> 2379 pickled_cmd, bvars, env, includes = 
> _prepare_for_python_RDD(self.ctx, command, self)
>2380 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(),
>2381  bytearray(pickled_cmd),
> /databricks/spark/python/pyspark/rdd.pyc in _prepare_for_python_RDD(sc, 
> command, obj)
>2297 # the serialized command will be compressed by broadcast
>2298 ser = CloudPickleSerializer()
> -> 2299 pickled_command = ser.dumps(command)
>2300 if len(pickled_command) > (1 << 20):  # 1M
>2301 # The broadcast will have same life cycle as created PythonRDD
> /databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj)
> 426 
> 427 def dumps(self, obj):
> --> 428 return cloudpickle.dumps(obj, 2)
> 429 
> 430 
> /databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, protocol)
> 655 
> 656 cp = CloudPickler(file,protocol)
> --> 657 cp.dump(obj)
> 658 
> 659 return file.getvalue()
> /databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj)
> 105 self.inject_addons()
> 106 try:
> --> 107 return Pickler.dump(self, obj)
> 108 except RuntimeError as e:
> 109 if 'recursion' in e.args[0]:
> /usr/lib/python2.7/pickle.pyc in dump(self, obj)
> 222 if self.proto >= 2:
> 223 self.write(PROTO + chr(self.proto))
> --> 224 self.save(obj)
> 225 self.write(STOP)
> 226 
> /usr/lib/python2.7/pickle.pyc in save(self, obj)
> 284 f = self.dispatch.get(t)
> 285 if f:
> --> 286 f(self, obj) # Call unbound method with explicit self
> 287 return
> 288 
> /usr/lib/python2.7/pickle.pyc in save_tuple(self, obj)
> 560 write(MARK)
> 561 for element in obj:
> --> 562 save(element)
> 563 
> 564 if id(obj) in memo:
> /usr/lib/python2.7/pickle.pyc in save(self, obj)
> 284 f = self.dispatch.get(t)
> 285 if f:
> --> 286 f(self, obj) # C

[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-14 Thread Davies Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491397#comment-15491397
 ] 

Davies Liu commented on SPARK-17544:


Could you post some code to reproduce the issue?

> Timeout waiting for connection from pool, DataFrame Reader's not closing S3 
> connections?
> 
>
> Key: SPARK-17544
> URL: https://issues.apache.org/jira/browse/SPARK-17544
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.0.0
> Environment: Amazon EMR, S3, Scala
>Reporter: Brady Auen
>  Labels: newbie
>
> I have an application that loops through a text file to find files in S3 and 
> then reads them in, performs some ETL processes, and then writes them out. 
> This works for around 80 loops until I get this:  
>   
> 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
> 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
>  for reading
> 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
> Timeout waiting for connection from pool
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
>  Timeout waiting for connection from pool
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
>   at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
>  Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
>   at 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
>   at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>   at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324)
>   at 
> org.apache.spark.sql.

[jira] [Resolved] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe

2016-09-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-17463.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15063
[https://github.com/apache/spark/pull/15063]

> Serialization of accumulators in heartbeats is not thread-safe
> --
>
> Key: SPARK-17463
> URL: https://issues.apache.org/jira/browse/SPARK-17463
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: Josh Rosen
>Assignee: Shixiong Zhu
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Check out the following {{ConcurrentModificationException}}:
> {code}
> 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = 
> Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 
> attempts
> org.apache.spark.SparkException: Exception thrown in awaitResult
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
> at 
> scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at 
> org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
> at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
> at 
> org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at 
> org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862)
> at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: java.util.ConcurrentModificationException
> at java.util.ArrayList.writeObject(ArrayList.java:766)
> at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at 
> java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at 
> java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
> at 
> java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
> at 
> java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
> at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
> at java.io.ObjectOutputStream.writeArray(ObjectOut

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491389#comment-15491389
 ] 

Cody Koeninger commented on SPARK-17510:


Just for clarity's sake, compute time is far higher on the stream that is using 
updateStateByKey?  Have you tried mapWIthState?

Changing max rate to be per-partition isn't actually a big change in terms of 
number of lines, the calculations are already done per partition because of 
backpressure.  It's more a question of whether it's worth adding more surface 
area to the creation api.  If I make a branch, are you in a position to test it 
with a kafka 0.10 cluster, or not?

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing

2016-09-14 Thread Thomas Graves (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-17511.
---
   Resolution: Fixed
 Assignee: Kishor Patil
Fix Version/s: 2.1.0
   2.0.1

> Dynamic allocation race condition: Containers getting marked failed while 
> releasing
> ---
>
> Key: SPARK-17511
> URL: https://issues.apache.org/jira/browse/SPARK-17511
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
>Reporter: Kishor Patil
>Assignee: Kishor Patil
> Fix For: 2.0.1, 2.1.0
>
>
> While trying to reach launch multiple containers in pool, if running 
> executors count reaches or goes beyond the target running executors, the 
> container is released and marked failed. This can cause many jobs to be 
> marked failed causing overall job failure.
> I will have a patch up soon after completing testing.
> {panel:title=Typical Exception found in Driver marking the container to 
> Failed}
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:156)
> at 
> org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
> at 
> org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {code}
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16424) Add support for Structured Streaming to the ML Pipeline API

2016-09-14 Thread holdenk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491250#comment-15491250
 ] 

holdenk commented on SPARK-16424:
-

Just an update - we have a really early proof of concept branch showing 
something like this approach can work. After the Strata talk later this month 
were planning on collecting any additional feedback and working on getting that 
proof of concept into a shape where we can make an early draft PR for people 
that are more interested in reviewing those than looking at design documents.

While I've reached out to [~marmbrus] & [~rams] on e-mail just pinging on the 
JIRA as well in-case you have a chance to take a look. Also cc [~tdas] who was 
helpful with answering some of my questions if you have any bandwidth to look 
at this that would be greatly appreciated.

> Add support for Structured Streaming to the ML Pipeline API
> ---
>
> Key: SPARK-16424
> URL: https://issues.apache.org/jira/browse/SPARK-16424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, SQL, Streaming
>Reporter: holdenk
>
> For Spark 2.1 we should consider adding support for machine learning on top 
> of the structured streaming API.
> Early work in progress design outline: 
> https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?

2016-09-14 Thread Brady Auen (JIRA)

Brady Auen created SPARK-17544:
--

 Summary: Timeout waiting for connection from pool, DataFrame 
Reader's not closing S3 connections?
 Key: SPARK-17544
 URL: https://issues.apache.org/jira/browse/SPARK-17544
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.0.0
 Environment: Amazon EMR, S3, Scala
Reporter: Brady Auen


I have an application that loops through a text file to find files in S3 and 
then reads them in, performs some ETL processes, and then writes them out. This 
works for around 80 loops until I get this:  
  
16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 
's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro'
 for reading
16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: 
Timeout waiting for connection from pool
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException:
 Timeout waiting for connection from pool
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown
 Source)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at 
com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015)
at 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212)
at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780)
at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313)
at 
org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289)
at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:452)
at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:458)
at scala.util.Try$.apply(Try.scala:192)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:451)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211)
at org.apache

[jira] [Resolved] (SPARK-10747) add support for NULLS FIRST|LAST in ORDER BY clause

2016-09-14 Thread Herman van Hovell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-10747.
---
   Resolution: Fixed
 Assignee: Xin Wu
Fix Version/s: 2.1.0

> add support for NULLS FIRST|LAST in ORDER BY clause
> ---
>
> Key: SPARK-10747
> URL: https://issues.apache.org/jira/browse/SPARK-10747
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.0
>Reporter: N Campbell
>Assignee: Xin Wu
> Fix For: 2.1.0
>
>
> You cannot express how NULLS are to be sorted in the window order 
> specification and have to use a compensating expression to simulate.
> Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' 
> near 'nulls'
> line 1:82 missing EOF at 'last' near 'nulls';
> SQLState:  null
> Same limitation as Hive reported in Apache JIRA HIVE-9535 )
> This fails
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc 
> nulls last) from tolap
> select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when 
> c3 is null then 1 else 0 end) from tolap



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186
 ] 

Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:53 PM:
-

Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}} (which apparently is 
the way one has to write the code to run without error)?


was (Author: zamir.e...@gmail.com):
Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}}?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186
 ] 

Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:52 PM:
-

Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says {{weightCol=None}}, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read {{weightCol=""}}?


was (Author: zamir.e...@gmail.com):
Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says `weightCol=None`, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read `weightCol=""`?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Evan Zamir (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186
 ] 

Evan Zamir commented on SPARK-17508:


Honestly, if the documentation was just more explicit, users wouldn't be so 
confused. But when it says `weightCol=None`, there's only one way we can 
interpret that in Python, and it happens to produce an error. Why doesn't 
someone just change the docstring to read `weightCol=""`?

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error

2016-09-14 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491163#comment-15491163
 ] 

Bryan Cutler commented on SPARK-17508:
--

I had a similar discussion in this PR 
https://github.com/apache/spark/pull/12790 about supporting a param value of 
{{None}}.  I think people mostly just wanted to keep PySpark in sync with the 
Scala API and not add additional support for {{None}}.

> Setting weightCol to None in ML library causes an error
> ---
>
> Key: SPARK-17508
> URL: https://issues.apache.org/jira/browse/SPARK-17508
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.0.0
>Reporter: Evan Zamir
>Priority: Minor
>
> The following code runs without error:
> {code}
> spark = SparkSession.builder.appName('WeightBug').getOrCreate()
> df = spark.createDataFrame(
> [
> (1.0, 1.0, Vectors.dense(1.0)),
> (0.0, 1.0, Vectors.dense(-1.0))
> ],
> ["label", "weight", "features"])
> lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight")
> model = lr.fit(df)
> {code}
> My expectation from reading the documentation is that setting weightCol=None 
> should treat all weights as 1.0 (regardless of whether a column exists). 
> However, the same code with weightCol set to None causes the following errors:
> Traceback (most recent call last):
>   File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in 
> 
> model = lr.fit(df)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line 
> 64, in fit
> return self._fit(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 213, in _fit
> java_model = self._fit_java(dataset)
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
> return self._java_obj.fit(dataset._jdf)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", 
> line 63, in deco
> return f(*a, **kw)
>   File 
> "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py",
>  line 312, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit.
> : java.lang.NullPointerException
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259)
>   at 
> org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:280)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:211)
>   at java.lang.Thread.run(Thread.java:745)
> Process finished with exit code 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Jeff Nadler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491165#comment-15491165
 ] 

Jeff Nadler commented on SPARK-17510:
-


Yes you're right - it's partly about differing rates but the big issue is that 
the compute time is far higher on one stream vs the other.   That's the only 
reason we need rate limiting at all, really.Overprovisioning would be 
expensive in this case.

We have run at a higher maxRate and let backpressure manage it.   It works, but 
it's suboptimal.   During surges of data, the rate limiter will fluctuate 
widely up/down as scheduling delay builds up & is burned off.   The algorithm 
is such that it does not ever level out, tho that's a separate opportunity for 
improvement in the backpressure impl.   When these big fluctuations in inbound 
rate happen, the total throughput on a longer time period (per hour) is far 
lower than it would be if we just calibrate a maxRate that's reflective (or 
close) to what our cluster is capable of handling.

Also just to be clear I understand that this is probably a long-haul change, 
and nothing that's going to solve our issues at this time.  It seems to me like 
this would be the right long term direction, but you all who are committers may 
not agree.   For sure I am starting to feel like I'm stacking up workarounds to 
make Spark Streaming viable for our particular use.

In my little fantasyland, there would be separate "SparkConf" for global 
settings and "StreamConf" that can be passed to the various KafkaUtils.* 
functions to set the spark.streaming.* settings independently for each stream.

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support

2016-09-14 Thread JIRA


[ 
https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491107#comment-15491107
 ] 

Maciej Bryński commented on SPARK-16534:


[~rxin]
Could you explain your decision ?

I think that dropping Python support is very bad sign for Spark community.
In my example I started with batch jobs in Python.
Having ability to use same code in Streaming Job was the main reason to choose 
Spark.
Eventually I'm here with production code written in Python supporting both 
batch and streaming.
And I won't be able to use new Kafka features like SSL. Never.
Am I right ? Or maybe I missed something.
As far as I understand your point is that support for Python in Spark Streaming 
is wrong and we shouldn't develop new features. 
I don't agree with that.

> Kafka 0.10 Python support
> -
>
> Key: SPARK-16534
> URL: https://issues.apache.org/jira/browse/SPARK-16534
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491087#comment-15491087
 ] 

Cody Koeninger edited comment on SPARK-17510 at 9/14/16 6:12 PM:
-

I use direct stream for multiple topic jobs where the rate varies by a factor 
of 1000x or more across topics, so I don't agree that case was "not really 
considered"  :)

It sounds more like the issue you're running into is that the amount of work 
per item varies widely across topics, not the rate across topics? 

Have you actually tried setting max rate at a level that's sufficient just to 
prevent crash on initial startup, then letting backpressure take over from 
there?

Edit - Also, you mentioned using updateStateByKey.  updateStateByKey does not 
scale particularly well, have you investigated mapWithState?


was (Author: c...@koeninger.org):
I use direct stream for multiple topic jobs where the rate varies by a factor 
of 1000x or more across topics, so I don't agree that case was "not really 
considered"  :)

It sounds more like the issue you're running into is that the amount of work 
per item varies widely across topics, not the rate across topics? 

Have you actually tried setting max rate at a level that's sufficient just to 
prevent crash on initial startup, then letting backpressure take over from 
there?



> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Cody Koeninger (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491087#comment-15491087
 ] 

Cody Koeninger commented on SPARK-17510:


I use direct stream for multiple topic jobs where the rate varies by a factor 
of 1000x or more across topics, so I don't agree that case was "not really 
considered"  :)

It sounds more like the issue you're running into is that the amount of work 
per item varies widely across topics, not the rate across topics? 

Have you actually tried setting max rate at a level that's sufficient just to 
prevent crash on initial startup, then letting backpressure take over from 
there?



> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17542) Compiler warning in UnsafeInMemorySorter class

2016-09-14 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-17542:
--
Issue Type: Improvement  (was: Bug)

There are unfortunately a number of warnings, and I don't think we want a JIRA 
for each of them. Somebody whacks a lot of them periodically. So, sure, they're 
good to resolve but I'd be more interested in sweeps that clean up as many 
warnings as possible. Some are un-suppressable and un-resolvable, like tests 
for deprecated methods in Scala. This isn't a bug though.

> Compiler warning in UnsafeInMemorySorter class
> --
>
> Key: SPARK-17542
> URL: https://issues.apache.org/jira/browse/SPARK-17542
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Frederick Reiss
>Priority: Trivial
>  Labels: starter
>
> *This is a small starter task to help new contributors practice the pull 
> request and code review process.*
> When building Spark with Java 8, there is a compiler warning in the class 
> UnsafeInMemorySorter:
> {noformat}
> [WARNING] 
> .../core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:284:
>  warning: [static] 
> static method should be qualified by type name, TaskMemoryManager, instead of
>  by an expression
> {noformat}
> Spark should compile without these kinds of warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17317) Add package vignette to SparkR

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490995#comment-15490995
 ] 

Apache Spark commented on SPARK-17317:
--

User 'junyangq' has created a pull request for this issue:
https://github.com/apache/spark/pull/15100

> Add package vignette to SparkR
> --
>
> Key: SPARK-17317
> URL: https://issues.apache.org/jira/browse/SPARK-17317
> Project: Spark
>  Issue Type: Improvement
>Reporter: Junyang Qian
>Assignee: Junyang Qian
> Fix For: 2.1.0
>
>
> In publishing SparkR to CRAN, it would be nice to have a vignette as a user 
> guide that
> * describes the big picture
> * introduces the use of various methods
> This is important for new users because they may not even know which method 
> to look up.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python

2016-09-14 Thread Davies Liu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu resolved SPARK-17514.

   Resolution: Fixed
Fix Version/s: 2.1.0
   2.0.1

Issue resolved by pull request 15068
[https://github.com/apache/spark/pull/15068]

> df.take(1) and df.limit(1).collect() perform differently in Python
> --
>
> Key: SPARK-17514
> URL: https://issues.apache.org/jira/browse/SPARK-17514
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Josh Rosen
>Assignee: Josh Rosen
> Fix For: 2.0.1, 2.1.0
>
>
> In PySpark, {{df.take(1)}} ends up running a single-stage job which computes 
> only one partition of {{df}}, while {{df.limit(1).collect()}} ends up 
> computing all partitions of {{df}} and runs a two-stage job. This difference 
> in performance is confusing, so I think that we should generalize the fix 
> from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently 
> in Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle

2016-09-14 Thread Frederick Reiss (JIRA)

Frederick Reiss created SPARK-17543:
---

 Summary: Missing log4j config file for tests in 
common/network-shuffle
 Key: SPARK-17543
 URL: https://issues.apache.org/jira/browse/SPARK-17543
 Project: Spark
  Issue Type: Bug
Reporter: Frederick Reiss
Priority: Trivial


*This is a small starter task to help new contributors practice the pull 
request and code review process.*

The Maven module {{common/network-shuffle}} does not have a log4j configuration 
file for its test cases. Usually these configuration files are located inside 
each module, in the directory {{src/test/resources}}. The missing configuration 
file leads to a scary-looking but harmless series of errors and stack traces in 
Spark build logs:
{noformat}
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further 
details.
log4j:ERROR Could not read configuration file from URL 
[file:src/test/resources/log4j.properties].
java.io.FileNotFoundException: src/test/resources/log4j.properties (No such 
file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at java.io.FileInputStream.(FileInputStream.java:101)
at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.(LogManager.java:127)
at org.apache.log4j.Logger.getLogger(Logger.java:104)
at 
io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29)
at 
io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46)
at 
io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34)
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17542) Compiler warning in UnsafeInMemorySorter class

2016-09-14 Thread Frederick Reiss (JIRA)

Frederick Reiss created SPARK-17542:
---

 Summary: Compiler warning in UnsafeInMemorySorter class
 Key: SPARK-17542
 URL: https://issues.apache.org/jira/browse/SPARK-17542
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Frederick Reiss
Priority: Trivial


*This is a small starter task to help new contributors practice the pull 
request and code review process.*

When building Spark with Java 8, there is a compiler warning in the class 
UnsafeInMemorySorter:
{noformat}
[WARNING] 
.../core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:284:
 warning: [static] 
static method should be qualified by type name, TaskMemoryManager, instead of
 by an expression
{noformat}

Spark should compile without these kinds of warnings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490922#comment-15490922
 ] 

Apache Spark commented on SPARK-17541:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/15099

> fix some DDL bugs about table management when same-name temp view exists
> 
>
> Key: SPARK-17541
> URL: https://issues.apache.org/jira/browse/SPARK-17541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17541:


Assignee: Wenchen Fan  (was: Apache Spark)

> fix some DDL bugs about table management when same-name temp view exists
> 
>
> Key: SPARK-17541
> URL: https://issues.apache.org/jira/browse/SPARK-17541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17541:


Assignee: Apache Spark  (was: Wenchen Fan)

> fix some DDL bugs about table management when same-name temp view exists
> 
>
> Key: SPARK-17541
> URL: https://issues.apache.org/jira/browse/SPARK-17541
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists

2016-09-14 Thread Wenchen Fan (JIRA)

Wenchen Fan created SPARK-17541:
---

 Summary: fix some DDL bugs about table management when same-name 
temp view exists
 Key: SPARK-17541
 URL: https://issues.apache.org/jira/browse/SPARK-17541
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17536) Minor performance improvement to JDBC batch inserts

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17536:


Assignee: (was: Apache Spark)

> Minor performance improvement to JDBC batch inserts
> ---
>
> Key: SPARK-17536
> URL: https://issues.apache.org/jira/browse/SPARK-17536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: John Muller
>Priority: Trivial
>  Labels: perfomance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> JDBC batch inserts currently are set to repeatedly retrieve the number of 
> fields inside the row iterator:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598
> val numFields = rddSchema.fields.length
> This value does not change and can be set prior to the loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17536) Minor performance improvement to JDBC batch inserts

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490849#comment-15490849
 ] 

Apache Spark commented on SPARK-17536:
--

User 'blue666man' has created a pull request for this issue:
https://github.com/apache/spark/pull/15098

> Minor performance improvement to JDBC batch inserts
> ---
>
> Key: SPARK-17536
> URL: https://issues.apache.org/jira/browse/SPARK-17536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: John Muller
>Priority: Trivial
>  Labels: perfomance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> JDBC batch inserts currently are set to repeatedly retrieve the number of 
> fields inside the row iterator:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598
> val numFields = rddSchema.fields.length
> This value does not change and can be set prior to the loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17536) Minor performance improvement to JDBC batch inserts

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17536:


Assignee: Apache Spark

> Minor performance improvement to JDBC batch inserts
> ---
>
> Key: SPARK-17536
> URL: https://issues.apache.org/jira/browse/SPARK-17536
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: John Muller
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: perfomance
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> JDBC batch inserts currently are set to repeatedly retrieve the number of 
> fields inside the row iterator:
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598
> val numFields = rddSchema.fields.length
> This value does not change and can be set prior to the loop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-15835) The read path of json doesn't support write path when schema contains Options

2016-09-14 Thread Chris Horn (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490836#comment-15490836
 ] 

Chris Horn commented on SPARK-15835:


You can work around this issue by providing the schema StructType up front to 
the JSON reader.

This also has the added benefit of not eagerly scanning the entire JSON data 
set to derive a schema.

{code}
scala> 
spark.read.schema(org.apache.spark.sql.Encoders.product[Bug].schema).json(path).as[Bug].collect
res8: Array[Bug] = Array(Bug(abc,None))

scala> 
spark.read.schema(org.apache.spark.sql.Encoders.product[Bug].schema).json(path).collect
res9: Array[org.apache.spark.sql.Row] = Array([abc,null])
{code}

> The read path of json doesn't support write path when schema contains Options
> -
>
> Key: SPARK-15835
> URL: https://issues.apache.org/jira/browse/SPARK-15835
> Project: Spark
>  Issue Type: Bug
>Reporter: Burak Yavuz
>
> my schema contains optional fields. When these fields are written in json 
> (and all of these records are None), the field will be omitted during writes. 
> When reading, these fields can't be found and this throws an exception.
> Either during writes, the fields should be included as `null`, or the Dataset 
> should not require the field to exist in the DataFrame if the field is an 
> Option (which may be a better solution)
> {code}
> case class Bug(field1: String, field2: Option[String])
> Seq(Bug("abc", None)).toDS.write.json("/tmp/sqlBug")
> spark.read.json("/tmp/sqlBug").as[Bug]
> {code}
> stack trace:
> {code}
> org.apache.spark.sql.AnalysisException: cannot resolve '`field2`' given input 
> columns: [field1]
> at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17540:


Assignee: Apache Spark

> SparkR array serde cannot work correctly when array length == 0
> ---
>
> Key: SPARK-17540
> URL: https://issues.apache.org/jira/browse/SPARK-17540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>Assignee: Apache Spark
>
> SparkR cannot handle array serde when array length == 0
> when length = 0
> R side set the element type as class("somestring")
> so that scala side code receive it as a string array,
> but the array we need to transfer may be other types,
> it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-09-14 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-17540:


Assignee: (was: Apache Spark)

> SparkR array serde cannot work correctly when array length == 0
> ---
>
> Key: SPARK-17540
> URL: https://issues.apache.org/jira/browse/SPARK-17540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>
> SparkR cannot handle array serde when array length == 0
> when length = 0
> R side set the element type as class("somestring")
> so that scala side code receive it as a string array,
> but the array we need to transfer may be other types,
> it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-09-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490790#comment-15490790
 ] 

Apache Spark commented on SPARK-17540:
--

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/15097

> SparkR array serde cannot work correctly when array length == 0
> ---
>
> Key: SPARK-17540
> URL: https://issues.apache.org/jira/browse/SPARK-17540
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SparkR
>Affects Versions: 2.1.0
>Reporter: Weichen Xu
>
> SparkR cannot handle array serde when array length == 0
> when length = 0
> R side set the element type as class("somestring")
> so that scala side code receive it as a string array,
> but the array we need to transfer may be other types,
> it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0

2016-09-14 Thread Weichen Xu (JIRA)

Weichen Xu created SPARK-17540:
--

 Summary: SparkR array serde cannot work correctly when array 
length == 0
 Key: SPARK-17540
 URL: https://issues.apache.org/jira/browse/SPARK-17540
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SparkR
Affects Versions: 2.1.0
Reporter: Weichen Xu


SparkR cannot handle array serde when array length == 0
when length = 0
R side set the element type as class("somestring")
so that scala side code receive it as a string array,
but the array we need to transfer may be other types,
it will cause problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext

2016-09-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490737#comment-15490737
 ] 

Sean Owen commented on SPARK-17535:
---

I think this happens to work out because the context is held in an 
AtomicReference, and it's set as the very last statement in all paths. I am not 
sure it's worth changing and making this more complex unless there's a 
compelling reason though, like evidence this is heavily contended. For example, 
if later something else occurred in the synchronized block after the context 
was set, you'd be able to get into strange situations where the reference is 
available but the synchronized init hasn't completed.

> Performance Improvement of Signleton pattern in SparkContext
> 
>
> Key: SPARK-17535
> URL: https://issues.apache.org/jira/browse/SPARK-17535
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.0.0
>Reporter: WangJianfei
>  Labels: easyfix, performance
>
> I think the singleton pattern of SparkContext is inefficient if there are 
> many request to get the SparkContext,
> So we can write the singleton pattern as below,The second way if more 
> efficient when there are many request to get the SparkContext.
> {code}
>  // the current version
>   def getOrCreate(): SparkContext = {
> SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
>   if (activeContext.get() == null) {
> setActiveContext(new SparkContext(), allowMultipleContexts = false)
>   }
>   activeContext.get()
> }
>   }
>   // by myself
>   def getOrCreate(): SparkContext = {
> if (activeContext.get() == null) {
>   SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized {
> if(activeContext.get == null) {
>   @volatile val sparkContext = new SparkContext()
>   setActiveContext(sparkContext, allowMultipleContexts = false)
> }
>   }
>   activeContext.get()
> }
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0

2016-09-14 Thread Srinivas Rishindra Pothireddi (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-17538:
--
Description: 
I have a production job in spark 1.6.2 that registers four dataframes as 
tables. After testing the job in spark 2.0.0 one of the dataframes is not 
getting registered as a table.

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 1.6.2 is

temp1,temp2,temp3,temp4

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 2.0.0 is
temp1,temp2,temp3

so when the table 'temp4' is used by the job at a later stage an 
AnalysisException is raised in spark 2.0.0

There are no changes in the code whatsoever. 


 

 

  was:
I have a production job in spark 1.6.2 that registers four dataframes as 
tables. After testing the job in spark 2.0.0 one of the dataframes is not 
getting registered as a table.

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 1.6.2 is

temp1,temp2,temp3,temp4

output of sqlContext.tableNames() just after registering the fourth dataframe 
in spark 2.0.0 is
temp1,temp2,temp3

so when the table temp4 is used by the job at a later stage an 
AnalysisException is raised in spark 2.0.0

There are no changes in the code whatsoever. 


 

 


> sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
> -
>
> Key: SPARK-17538
> URL: https://issues.apache.org/jira/browse/SPARK-17538
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0, 2.0.1, 2.1.0
> Environment: os - linux
> cluster -> yarn and local
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> I have a production job in spark 1.6.2 that registers four dataframes as 
> tables. After testing the job in spark 2.0.0 one of the dataframes is not 
> getting registered as a table.
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 1.6.2 is
> temp1,temp2,temp3,temp4
> output of sqlContext.tableNames() just after registering the fourth dataframe 
> in spark 2.0.0 is
> temp1,temp2,temp3
> so when the table 'temp4' is used by the job at a later stage an 
> AnalysisException is raised in spark 2.0.0
> There are no changes in the code whatsoever. 
>  
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams

2016-09-14 Thread Jeff Nadler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490746#comment-15490746
 ] 

Jeff Nadler commented on SPARK-17510:
-

I filed SPARK-17539 for the backpressure bug. 

> Set Streaming MaxRate Independently For Multiple Streams
> 
>
> Key: SPARK-17510
> URL: https://issues.apache.org/jira/browse/SPARK-17510
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 2.0.0
>Reporter: Jeff Nadler
>
> We use multiple DStreams coming from different Kafka topics in a Streaming 
> application.
> Some settings like maxrate and backpressure enabled/disabled would be better 
> passed as config to KafkaUtils.createStream and 
> KafkaUtils.createDirectStream, instead of setting them in SparkConf.
> Being able to set a different maxrate for different streams is an important 
> requirement for us; we currently work-around the problem by using one 
> receiver-based stream and one direct stream.   
> We would like to be able to turn on backpressure for only one of the streams 
> as well.   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 145 matches

Mail list logo