[jira] [Updated] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table
[ https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17441: Assignee: Xiao Li > Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data > source table > - > > Key: SPARK-17441 > URL: https://issues.apache.org/jira/browse/SPARK-17441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just > like the other `ALTER PARTITION` commands. We should issue an exception > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW
[ https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17440. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15004 [https://github.com/apache/spark/pull/15004] > Issue Exception when ALTER TABLE commands try to alter a VIEW > - > > Key: SPARK-17440 > URL: https://issues.apache.org/jira/browse/SPARK-17440 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > For the following `ALTER TABLE` DDL, we should issue an exception when the > target table is a `VIEW`: > {code} > ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart' > ALTER TABLE viewName SET SERDE 'whatever' > ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y') > ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y') > ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8') > ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2') > ALTER TABLE viewName RECOVER PARTITIONS > ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', > b='p') > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17441) Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data source table
[ https://issues.apache.org/jira/browse/SPARK-17441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-17441. - Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15004 [https://github.com/apache/spark/pull/15004] > Issue Exceptions when ALTER TABLE RENAME PARTITION tries to alter a data > source table > - > > Key: SPARK-17441 > URL: https://issues.apache.org/jira/browse/SPARK-17441 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.1.0 > > > `ALTER TABLE RENAME PARTITION` is unable to handle data source tables, just > like the other `ALTER PARTITION` commands. We should issue an exception > instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17440) Issue Exception when ALTER TABLE commands try to alter a VIEW
[ https://issues.apache.org/jira/browse/SPARK-17440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-17440: Assignee: Xiao Li > Issue Exception when ALTER TABLE commands try to alter a VIEW > - > > Key: SPARK-17440 > URL: https://issues.apache.org/jira/browse/SPARK-17440 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xiao Li >Assignee: Xiao Li > > For the following `ALTER TABLE` DDL, we should issue an exception when the > target table is a `VIEW`: > {code} > ALTER TABLE viewName SET LOCATION '/path/to/your/lovely/heart' > ALTER TABLE viewName SET SERDE 'whatever' > ALTER TABLE viewName SET SERDEPROPERTIES ('x' = 'y') > ALTER TABLE viewName PARTITION (a=1, b=2) SET SERDEPROPERTIES ('x' = 'y') > ALTER TABLE viewName ADD IF NOT EXISTS PARTITION (a='4', b='8') > ALTER TABLE viewName DROP IF EXISTS PARTITION (a='2') > ALTER TABLE viewName RECOVER PARTITIONS > ALTER TABLE viewName PARTITION (a='1', b='q') RENAME TO PARTITION (a='100', > b='p') > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle
[ https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492494#comment-15492494 ] Apache Spark commented on SPARK-17543: -- User 'jagadeesanas2' has created a pull request for this issue: https://github.com/apache/spark/pull/15108 > Missing log4j config file for tests in common/network-shuffle > - > > Key: SPARK-17543 > URL: https://issues.apache.org/jira/browse/SPARK-17543 > Project: Spark > Issue Type: Bug >Reporter: Frederick Reiss >Priority: Trivial > Labels: starter > > *This is a small starter task to help new contributors practice the pull > request and code review process.* > The Maven module {{common/network-shuffle}} does not have a log4j > configuration file for its test cases. Usually these configuration files are > located inside each module, in the directory {{src/test/resources}}. The > missing configuration file leads to a scary-looking but harmless series of > errors and stack traces in Spark build logs: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > log4j:ERROR Could not read configuration file from URL > [file:src/test/resources/log4j.properties]. > java.io.FileNotFoundException: src/test/resources/log4j.properties (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) > at > sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) > at org.apache.log4j.LogManager.(LogManager.java:127) > at org.apache.log4j.Logger.getLogger(Logger.java:104) > at > io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29) > at > io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46) > at > io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle
[ https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17543: Assignee: (was: Apache Spark) > Missing log4j config file for tests in common/network-shuffle > - > > Key: SPARK-17543 > URL: https://issues.apache.org/jira/browse/SPARK-17543 > Project: Spark > Issue Type: Bug >Reporter: Frederick Reiss >Priority: Trivial > Labels: starter > > *This is a small starter task to help new contributors practice the pull > request and code review process.* > The Maven module {{common/network-shuffle}} does not have a log4j > configuration file for its test cases. Usually these configuration files are > located inside each module, in the directory {{src/test/resources}}. The > missing configuration file leads to a scary-looking but harmless series of > errors and stack traces in Spark build logs: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > log4j:ERROR Could not read configuration file from URL > [file:src/test/resources/log4j.properties]. > java.io.FileNotFoundException: src/test/resources/log4j.properties (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) > at > sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) > at org.apache.log4j.LogManager.(LogManager.java:127) > at org.apache.log4j.Logger.getLogger(Logger.java:104) > at > io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29) > at > io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46) > at > io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle
[ https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17543: Assignee: Apache Spark > Missing log4j config file for tests in common/network-shuffle > - > > Key: SPARK-17543 > URL: https://issues.apache.org/jira/browse/SPARK-17543 > Project: Spark > Issue Type: Bug >Reporter: Frederick Reiss >Assignee: Apache Spark >Priority: Trivial > Labels: starter > > *This is a small starter task to help new contributors practice the pull > request and code review process.* > The Maven module {{common/network-shuffle}} does not have a log4j > configuration file for its test cases. Usually these configuration files are > located inside each module, in the directory {{src/test/resources}}. The > missing configuration file leads to a scary-looking but harmless series of > errors and stack traces in Spark build logs: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > log4j:ERROR Could not read configuration file from URL > [file:src/test/resources/log4j.properties]. > java.io.FileNotFoundException: src/test/resources/log4j.properties (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) > at > sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) > at org.apache.log4j.LogManager.(LogManager.java:127) > at org.apache.log4j.Logger.getLogger(Logger.java:104) > at > io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29) > at > io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46) > at > io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17537) Improve performance for reading parquet schema
[ https://issues.apache.org/jira/browse/SPARK-17537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492475#comment-15492475 ] Hyukjin Kwon commented on SPARK-17537: -- This should be a duplicate of SPARK-17071. > Improve performance for reading parquet schema > -- > > Key: SPARK-17537 > URL: https://issues.apache.org/jira/browse/SPARK-17537 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.6.2, 2.0.0 >Reporter: Yang Wang >Priority: Minor > > spark.read.parquet would issue a spark job to read parquet schema. When > spark.sql.parquet.mergeSchema are set to false (the default value), there is > often only one file to read, so there is no need to issue a spark job to do > it. In this case, we can read it from driver directly instead of issuing a > spark job. This could reduce the infer schema latency from several hundreds > milliseconds to around ten milliseconds in my environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle
[ https://issues.apache.org/jira/browse/SPARK-17543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492406#comment-15492406 ] Jagadeesan A S commented on SPARK-17543: Started working on this. > Missing log4j config file for tests in common/network-shuffle > - > > Key: SPARK-17543 > URL: https://issues.apache.org/jira/browse/SPARK-17543 > Project: Spark > Issue Type: Bug >Reporter: Frederick Reiss >Priority: Trivial > Labels: starter > > *This is a small starter task to help new contributors practice the pull > request and code review process.* > The Maven module {{common/network-shuffle}} does not have a log4j > configuration file for its test cases. Usually these configuration files are > located inside each module, in the directory {{src/test/resources}}. The > missing configuration file leads to a scary-looking but harmless series of > errors and stack traces in Spark build logs: > {noformat} > SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". > SLF4J: Defaulting to no-operation (NOP) logger implementation > SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further > details. > log4j:ERROR Could not read configuration file from URL > [file:src/test/resources/log4j.properties]. > java.io.FileNotFoundException: src/test/resources/log4j.properties (No such > file or directory) > at java.io.FileInputStream.open(Native Method) > at java.io.FileInputStream.(FileInputStream.java:146) > at java.io.FileInputStream.(FileInputStream.java:101) > at > sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) > at > sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) > at > org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) > at > org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) > at org.apache.log4j.LogManager.(LogManager.java:127) > at org.apache.log4j.Logger.getLogger(Logger.java:104) > at > io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29) > at > io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46) > at > io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34) > ... > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17551) support null ordering for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17551: Assignee: Apache Spark > support null ordering for DataFrame API > --- > > Key: SPARK-17551 > URL: https://issues.apache.org/jira/browse/SPARK-17551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu >Assignee: Apache Spark > > SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for > SQL interface. This JIRA is to complete this feature by adding same support > for DataFrame/Dataset APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17551) support null ordering for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492342#comment-15492342 ] Apache Spark commented on SPARK-17551: -- User 'xwu0226' has created a pull request for this issue: https://github.com/apache/spark/pull/15107 > support null ordering for DataFrame API > --- > > Key: SPARK-17551 > URL: https://issues.apache.org/jira/browse/SPARK-17551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for > SQL interface. This JIRA is to complete this feature by adding same support > for DataFrame/Dataset APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17551) support null ordering for DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-17551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17551: Assignee: (was: Apache Spark) > support null ordering for DataFrame API > --- > > Key: SPARK-17551 > URL: https://issues.apache.org/jira/browse/SPARK-17551 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Xin Wu > > SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for > SQL interface. This JIRA is to complete this feature by adding same support > for DataFrame/Dataset APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17551) support null ordering for DataFrame API
Xin Wu created SPARK-17551: -- Summary: support null ordering for DataFrame API Key: SPARK-17551 URL: https://issues.apache.org/jira/browse/SPARK-17551 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.1.0 Reporter: Xin Wu SPARK-10747 has added support for NULLS FIRST | LAST in ORDER BY clause for SQL interface. This JIRA is to complete this feature by adding same support for DataFrame/Dataset APIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangJianfei updated SPARK-17535: Comment: was deleted (was: First: this logic and code is simple too. Second:After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext.) > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492084#comment-15492084 ] WangJianfei commented on SPARK-17535: - First: this logic and code is simple too. Second:After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext. > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492046#comment-15492046 ] WangJianfei edited comment on SPARK-17535 at 9/15/16 2:10 AM: -- First: this logic and code is simple too. Second:After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext. was (Author: codlife): After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext. > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] WangJianfei updated SPARK-17535: Comment: was deleted (was: After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext.) > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492046#comment-15492046 ] WangJianfei commented on SPARK-17535: - After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext. > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15492047#comment-15492047 ] WangJianfei commented on SPARK-17535: - After JDK1.5, The volatile can avoid the situation you say.We can't get into some strange situations where the reference is available but the synchronized init hasn't completed,because the volatile will tell the JVM to not do instruction reordering about the init of sparkContext. > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15573) Backwards-compatible persistence for spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491974#comment-15491974 ] yuhao yang commented on SPARK-15573: This sounds feasible. Two primary work items as I see: 1. Find a scandalized way to save models for different versions. 2. How to ensure model loading correctness for all the versions. (there might be parameter values change across versions). > Backwards-compatible persistence for spark.ml > - > > Key: SPARK-15573 > URL: https://issues.apache.org/jira/browse/SPARK-15573 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This JIRA is for imposing backwards-compatible persistence for the > DataFrames-based API for MLlib. I.e., we want to be able to load models > saved in previous versions of Spark. We will not require loading models > saved in later versions of Spark. > This requires: > * Putting unit tests in place to check loading models from previous versions > * Notifying all committers active on MLlib to be aware of this requirement in > the future > The unit tests could be written as in spark.mllib, where we essentially > copied and pasted the save() code every time it changed. This happens > rarely, so it should be acceptable, though other designs are fine. > Subtasks of this JIRA should cover checking and adding tests for existing > cases, such as KMeansModel (whose format changed between 1.6 and 2.0). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491938#comment-15491938 ] Josh Rosen commented on SPARK-17544: I found a similar issue from the {{spark-avro}} repository: https://github.com/databricks/spark-avro/issues/156 > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) > at > org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolve
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491841#comment-15491841 ] Bryan Cutler commented on SPARK-17508: -- To respond to [~srowen] question, I think it's reasonable to have {{None}} act the same as not setting the param. I couldn't find any current params where this would cause a problem. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-17549: --- Attachment: spark-2.0.patch Attaching a Spark 2 patch that silences the error (looks like a Janino bug and is in an area that should not affect functionality from what I understand). It'd be great if someone more familiar with this area could take a look at whether my changes are sane, before I go out and file a PR. > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch, spark-2.0.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apac
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491819#comment-15491819 ] Michael Armbrust commented on SPARK-16407: -- I think it is likely that we will want to make it easy for users to instantiate Sources / Sinks programmatically. I have two concerns with merging this particular PR at this time: - The source/sink APIs are in a lot of flux. The fact that they are not labeled experimental in 2.0.0 was an unfortunate oversight that we'll correct in 2.0.1. For example [~freiss] pointed out that it was impossible to know when to flush state from the source. Another example is we'll probably have to make changes to allow for rate limiting. Until we have a better idea of what the final API will look like, I would not want to expose it in the end-user APIs like DataStreamWriter. - When we added the {{foreach}} API, it was a conscious choice to avoid exposing dataframes / batches to the user. The idea is we could adapt code to a non-microbatch model eventually if we wanted. This API would break that. Its not out of the question, but thats a larger shift. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-17549: --- Attachment: spark-1.6-2.patch Just noticed there's already a more accurate count of the batch size in the code, so uploading an updated patch. I'll try to figure out what's wrong in Spark 2 but not really familiar with that area of the code where the exceptions are coming from. > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6-2.patch, spark-1.6.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$
[jira] [Created] (SPARK-17550) DataFrameWriter.partitionBy() should throw exception if column is not present in Dataframe
Aniket Kulkarni created SPARK-17550: --- Summary: DataFrameWriter.partitionBy() should throw exception if column is not present in Dataframe Key: SPARK-17550 URL: https://issues.apache.org/jira/browse/SPARK-17550 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aniket Kulkarni Priority: Minor I have a spark job which performs certain computations on event data and eventually persists it to hive. I was trying to write to hive using the code snippet shown below : dataframe.write.format("orc").partitionBy(col1,col2).options(options).mode(SaveMode.Append).saveAsTable(hiveTable) The write to hive was not working as col2 in the above example was not present in the dataframe. It was a little tedious to debug this as no exception or message showed up in the logs. I was constantly seeing executor lost failures in the logs and nothing more. I think there should be an exception thrown when one tries to write to hive on a partitioning column that does not exist. If this is indeed something that needs to be fixed, I would like to volunteer to fix this in the spark-core code base. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491787#comment-15491787 ] holdenk commented on SPARK-16407: - That's part of why I decided to just use the ForeachRDD sink as testing artifact and not expose it. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491784#comment-15491784 ] holdenk commented on SPARK-16407: - It's true it doesn't work in SQL - but I don't think the current stream writer interface works so well in SQL anyways. While its true this doesn't yet expose a Python API - thats generally true for structured streaming. Once we do add a Python API its possible one could wrap Scala sinks that do custom callbacks to Python (similar to how the current Python streaming API works) or otherwise provide wrappers for JVM sinks as is the general process for a lot of PySpark. I'm not advocating for this to replace the string based API but to compliment it to allow more flexibility for users (as demonstrated in the provided test). > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491779#comment-15491779 ] Reynold Xin commented on SPARK-16407: - Actually I spoke too soon. I only read the code change and didn't read the description on the pr. I think the high level idea to pass in a sink object makes sense, but I don't think the goal should be to support foreachRdd, because that goes directly against the spirit of one of the high level goals of structured streaming: decouple the logical plan from the physical execution, which includes not exposing batches to the end-user. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491772#comment-15491772 ] holdenk commented on SPARK-16407: - Right the simplest example where you need to use the typed API is with the provided ForeachSink that is used for testing. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16439: Assignee: Apache Spark (was: Maciej BryĆski) > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski >Assignee: Apache Spark >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16439: Assignee: Maciej BryĆski (was: Apache Spark) > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski >Assignee: Maciej BryĆski >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491766#comment-15491766 ] Apache Spark commented on SPARK-16439: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/15106 > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski >Assignee: Maciej BryĆski >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17549) InMemoryRelation doesn't scale to large tables
[ https://issues.apache.org/jira/browse/SPARK-17549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin updated SPARK-17549: --- Attachment: spark-1.6.patch example_1.6_pre_patch.png example_1.6_post_patch.png create_parquet.scala > InMemoryRelation doesn't scale to large tables > -- > > Key: SPARK-17549 > URL: https://issues.apache.org/jira/browse/SPARK-17549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.6.0, 2.0.0 >Reporter: Marcelo Vanzin > Attachments: create_parquet.scala, example_1.6_post_patch.png, > example_1.6_pre_patch.png, spark-1.6.patch > > > An {{InMemoryRelation}} is created when you cache a table; but if the table > is large, defined by either having a really large amount of columns, or a > really large amount of partitions (in the file split sense, not the "table > partition" sense), or both, it causes an immense amount of memory to be used > in the driver. > The reason is that it uses an accumulator to collect statistics about each > partition, and instead of summarizing the data in the driver, it keeps *all* > entries in memory. > I'm attaching a script I used to create a parquet file with 20,000 columns > and a single row, which I then copied 500 times so I'd have 500 partitions. > When doing the following: > {code} > sqlContext.read.parquet(...).count() > {code} > Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the > settings I used, but it works.) > I ran spark-shell like this: > {code} > ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g > --conf spark.executor.memory=2g > {code} > And ran: > {code} > sqlContext.read.parquet(...).cache().count() > {code} > You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 > partitions were processed, there were 40 GenericInternalRow objects with > 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage > was: > {code} > 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB > {code} > (Note: Integer = 20 bytes, Long = 24 bytes.) > If I waited until the end, there would be 500 partitions, so ~ 5GB of memory > to hold the stats. > I'm also attaching a patch I made on top of 1.6 that uses just a long > accumulator to capture the table size; with that patch memory usage on the > driver doesn't keep growing. Also note in the patch that I'm multiplying the > column size by the row count, which I think is a different bug in the > existing code (those stats should be for the whole batch, not just a single > row, right?). I also added {{example_1.6_post_patch.png}} to show the > {{InMemoryRelation}} with the patch. > I also applied a very similar patch on top of Spark 2.0. But there things > blow up even more spectacularly when I try to run the count on the cached > table. It starts with this error: > {noformat} > 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, > vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: > java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: > Index: 63235, Size: 1 > (lots of generated code here...) > Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 > at java.util.ArrayList.rangeCheck(ArrayList.java:635) > at java.util.ArrayList.get(ArrayList.java:411) > at > org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) > at > org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) > at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) > at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) > at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) > at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) > at scala.collection.Iterator$class.foreach(Iterator.scala:893) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at scala.collection.AbstractIterable.foreach(Iterable.scala:54) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883) > ... 54 more > {noformat} > And basically a
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491752#comment-15491752 ] Reynold Xin commented on SPARK-16407: - This doesn't work in SQL, Python, etc. I like the general idea of providing a way to make the complex things possible, but we shouldn't make this the default way and allow developers to use this as an excuse to build APIs that are not compatible across languages. > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reopened SPARK-16439: We could bring the seperator back for better readability. > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski >Assignee: Maciej BryĆski >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17549) InMemoryRelation doesn't scale to large tables
Marcelo Vanzin created SPARK-17549: -- Summary: InMemoryRelation doesn't scale to large tables Key: SPARK-17549 URL: https://issues.apache.org/jira/browse/SPARK-17549 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0, 1.6.0 Reporter: Marcelo Vanzin An {{InMemoryRelation}} is created when you cache a table; but if the table is large, defined by either having a really large amount of columns, or a really large amount of partitions (in the file split sense, not the "table partition" sense), or both, it causes an immense amount of memory to be used in the driver. The reason is that it uses an accumulator to collect statistics about each partition, and instead of summarizing the data in the driver, it keeps *all* entries in memory. I'm attaching a script I used to create a parquet file with 20,000 columns and a single row, which I then copied 500 times so I'd have 500 partitions. When doing the following: {code} sqlContext.read.parquet(...).count() {code} Everything works fine, both in Spark 1.6 and 2.0. (It's super slow with the settings I used, but it works.) I ran spark-shell like this: {code} ./bin/spark-shell --master 'local-cluster[4,1,4096]' --driver-memory 2g --conf spark.executor.memory=2g {code} And ran: {code} sqlContext.read.parquet(...).cache().count() {code} You'll see the results in screenshot {{example_1.6_pre_patch.png}}. After 40 partitions were processed, there were 40 GenericInternalRow objects with 100,000 items each (5 stat info fields * 20,000 columns). So, memory usage was: {code} 40 * 10 * (4 * 20 + 24) = 41600 =~ 400MB {code} (Note: Integer = 20 bytes, Long = 24 bytes.) If I waited until the end, there would be 500 partitions, so ~ 5GB of memory to hold the stats. I'm also attaching a patch I made on top of 1.6 that uses just a long accumulator to capture the table size; with that patch memory usage on the driver doesn't keep growing. Also note in the patch that I'm multiplying the column size by the row count, which I think is a different bug in the existing code (those stats should be for the whole batch, not just a single row, right?). I also added {{example_1.6_post_patch.png}} to show the {{InMemoryRelation}} with the patch. I also applied a very similar patch on top of Spark 2.0. But there things blow up even more spectacularly when I try to run the count on the cached table. It starts with this error: {noformat} 14:19:43 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 1.0 (TID 2, vanzin-st1-3.gce.cloudera.com): java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 (lots of generated code here...) Caused by: java.lang.IndexOutOfBoundsException: Index: 63235, Size: 1 at java.util.ArrayList.rangeCheck(ArrayList.java:635) at java.util.ArrayList.get(ArrayList.java:411) at org.codehaus.janino.util.ClassFile.getConstantPoolInfo(ClassFile.java:556) at org.codehaus.janino.util.ClassFile.getConstantUtf8(ClassFile.java:572) at org.codehaus.janino.util.ClassFile.loadAttribute(ClassFile.java:1513) at org.codehaus.janino.util.ClassFile.loadAttributes(ClassFile.java:644) at org.codehaus.janino.util.ClassFile.loadFields(ClassFile.java:623) at org.codehaus.janino.util.ClassFile.(ClassFile.java:280) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:913) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$recordCompilationStats$1.apply(CodeGenerator.scala:911) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) at scala.collection.AbstractIterable.foreach(Iterable.scala:54) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.recordCompilationStats(CodeGenerator.scala:911) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:883) ... 54 more {noformat} And basically a lot of that going on making the output unreadable, so I just killed the shell. Anyway, I believe the same fix should work there, but I can't be sure because the test doesn't work for different reasons, it seems. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16439) Incorrect information in SQL Query details
[ https://issues.apache.org/jira/browse/SPARK-16439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491744#comment-15491744 ] Davies Liu commented on SPARK-16439: The separator was added on purpose, otherwise it's very difficult to read the numbers (especially the number of rows), we need to count the number of digits to realize how large the number is. I think we should still keep that and fix the local issue (using English should be enough), I will send a PR to add it back. > Incorrect information in SQL Query details > -- > > Key: SPARK-16439 > URL: https://issues.apache.org/jira/browse/SPARK-16439 > Project: Spark > Issue Type: Bug > Components: SQL, Web UI >Affects Versions: 2.0.0 >Reporter: Maciej BryĆski >Assignee: Maciej BryĆski >Priority: Minor > Fix For: 2.0.0 > > Attachments: sample.png, spark.jpg > > > One picture is worth a thousand words. > Please see attachment > Incorrect values are in fields: > * data size > * number of output rows > * time to collect -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16407) Allow users to supply custom StreamSinkProviders
[ https://issues.apache.org/jira/browse/SPARK-16407?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491739#comment-15491739 ] Shixiong Zhu commented on SPARK-16407: -- Right now we don't want to add such typed API to DataStreamWriter as it's not easy to port to other languages. Could you give us an example that you must use a typed API? Also cc [~rxin] [~marmbrus] > Allow users to supply custom StreamSinkProviders > > > Key: SPARK-16407 > URL: https://issues.apache.org/jira/browse/SPARK-16407 > Project: Spark > Issue Type: Improvement > Components: Streaming >Reporter: holdenk > > The current DataStreamWriter allows users to specify a class name as format, > however it could be easier for people to directly pass in a specific provider > instance - e.g. for user equivalent of ForeachSink or other sink with > non-string parameters. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector
[ https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17548: Assignee: (was: Apache Spark) > Word2VecModel.findSynonyms can spuriously reject the best match when invoked > with a vector > -- > > Key: SPARK-17548 > URL: https://issues.apache.org/jira/browse/SPARK-17548 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0 > Environment: any >Reporter: William Benton >Priority: Minor > > The `findSynonyms` method in `Word2VecModel` currently rejects the best match > a priori. When `findSynonyms` is invoked with a word, the best match is > almost certain to be that word, but `findSynonyms` can also be invoked with a > vector, which might not correspond to any of the words in the model's > vocabulary. In the latter case, rejecting the best match is spurious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector
[ https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491626#comment-15491626 ] Apache Spark commented on SPARK-17548: -- User 'willb' has created a pull request for this issue: https://github.com/apache/spark/pull/15105 > Word2VecModel.findSynonyms can spuriously reject the best match when invoked > with a vector > -- > > Key: SPARK-17548 > URL: https://issues.apache.org/jira/browse/SPARK-17548 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0 > Environment: any >Reporter: William Benton >Priority: Minor > > The `findSynonyms` method in `Word2VecModel` currently rejects the best match > a priori. When `findSynonyms` is invoked with a word, the best match is > almost certain to be that word, but `findSynonyms` can also be invoked with a > vector, which might not correspond to any of the words in the model's > vocabulary. In the latter case, rejecting the best match is spurious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector
[ https://issues.apache.org/jira/browse/SPARK-17548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17548: Assignee: Apache Spark > Word2VecModel.findSynonyms can spuriously reject the best match when invoked > with a vector > -- > > Key: SPARK-17548 > URL: https://issues.apache.org/jira/browse/SPARK-17548 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.4.1, 1.5.2, 1.6.2, 2.0.0 > Environment: any >Reporter: William Benton >Assignee: Apache Spark >Priority: Minor > > The `findSynonyms` method in `Word2VecModel` currently rejects the best match > a priori. When `findSynonyms` is invoked with a word, the best match is > almost certain to be that word, but `findSynonyms` can also be invoked with a > vector, which might not correspond to any of the words in the model's > vocabulary. In the latter case, rejecting the best match is spurious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17548) Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector
William Benton created SPARK-17548: -- Summary: Word2VecModel.findSynonyms can spuriously reject the best match when invoked with a vector Key: SPARK-17548 URL: https://issues.apache.org/jira/browse/SPARK-17548 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 2.0.0, 1.6.2, 1.5.2, 1.4.1 Environment: any Reporter: William Benton Priority: Minor The `findSynonyms` method in `Word2VecModel` currently rejects the best match a priori. When `findSynonyms` is invoked with a word, the best match is almost certain to be that word, but `findSynonyms` can also be invoked with a vector, which might not correspond to any of the words in the model's vocabulary. In the latter case, rejecting the best match is spurious. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write
[ https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17547: Assignee: Josh Rosen (was: Apache Spark) > Temporary shuffle data files may be leaked following exception in write > --- > > Key: SPARK-17547 > URL: https://issues.apache.org/jira/browse/SPARK-17547 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.5.3, 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > SPARK-8029 modified shuffle writers to first stage their data to a temporary > file in the same directory as the final destination file and then to > atomically rename the file at the end of the write job. However, this change > introduced the potential for the temporary output file to be leaked if an > exception occurs during the write because the shuffle writers' existing error > cleanup code doesn't handle this new temp file. > This is easy to fix: we just need to add a {{finally}} block to ensure that > the temporary file is guaranteed to be either moved or deleted before > existing the shuffle write method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write
[ https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491548#comment-15491548 ] Apache Spark commented on SPARK-17547: -- User 'JoshRosen' has created a pull request for this issue: https://github.com/apache/spark/pull/15104 > Temporary shuffle data files may be leaked following exception in write > --- > > Key: SPARK-17547 > URL: https://issues.apache.org/jira/browse/SPARK-17547 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.5.3, 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Josh Rosen > > SPARK-8029 modified shuffle writers to first stage their data to a temporary > file in the same directory as the final destination file and then to > atomically rename the file at the end of the write job. However, this change > introduced the potential for the temporary output file to be leaked if an > exception occurs during the write because the shuffle writers' existing error > cleanup code doesn't handle this new temp file. > This is easy to fix: we just need to add a {{finally}} block to ensure that > the temporary file is guaranteed to be either moved or deleted before > existing the shuffle write method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write
[ https://issues.apache.org/jira/browse/SPARK-17547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17547: Assignee: Apache Spark (was: Josh Rosen) > Temporary shuffle data files may be leaked following exception in write > --- > > Key: SPARK-17547 > URL: https://issues.apache.org/jira/browse/SPARK-17547 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.5.3, 1.6.0, 2.0.0 >Reporter: Josh Rosen >Assignee: Apache Spark > > SPARK-8029 modified shuffle writers to first stage their data to a temporary > file in the same directory as the final destination file and then to > atomically rename the file at the end of the write job. However, this change > introduced the potential for the temporary output file to be leaked if an > exception occurs during the write because the shuffle writers' existing error > cleanup code doesn't handle this new temp file. > This is easy to fix: we just need to add a {{finally}} block to ensure that > the temporary file is guaranteed to be either moved or deleted before > existing the shuffle write method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17547) Temporary shuffle data files may be leaked following exception in write
Josh Rosen created SPARK-17547: -- Summary: Temporary shuffle data files may be leaked following exception in write Key: SPARK-17547 URL: https://issues.apache.org/jira/browse/SPARK-17547 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 2.0.0, 1.6.0, 1.5.3 Reporter: Josh Rosen Assignee: Josh Rosen SPARK-8029 modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename the file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle this new temp file. This is easy to fix: we just need to add a {{finally}} block to ensure that the temporary file is guaranteed to be either moved or deleted before existing the shuffle write method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491518#comment-15491518 ] Apache Spark commented on SPARK-17100: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/15103 > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Davies Liu > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17100: Assignee: Davies Liu (was: Apache Spark) > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Davies Liu > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17100: Assignee: Apache Spark (was: Davies Liu) > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Apache Spark > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17346) Kafka 0.10 support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491487#comment-15491487 ] Apache Spark commented on SPARK-17346: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/15102 > Kafka 0.10 support in Structured Streaming > -- > > Key: SPARK-17346 > URL: https://issues.apache.org/jira/browse/SPARK-17346 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Implement Kafka 0.10-based sources and sinks for Structured Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17346) Kafka 0.10 support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17346: Assignee: Apache Spark > Kafka 0.10 support in Structured Streaming > -- > > Key: SPARK-17346 > URL: https://issues.apache.org/jira/browse/SPARK-17346 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss >Assignee: Apache Spark > > Implement Kafka 0.10-based sources and sinks for Structured Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17346) Kafka 0.10 support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-17346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17346: Assignee: (was: Apache Spark) > Kafka 0.10 support in Structured Streaming > -- > > Key: SPARK-17346 > URL: https://issues.apache.org/jira/browse/SPARK-17346 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Frederick Reiss > > Implement Kafka 0.10-based sources and sinks for Structured Streaming. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491460#comment-15491460 ] Jeff Nadler commented on SPARK-17510: - That would be incredible, thank you very much for looking into this. If we were able to set maxRate independently I think we'd just use Direct Streams for both topics, with no maxRate for the session stream, and our normal maxRate for the events. > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491455#comment-15491455 ] Cody Koeninger commented on SPARK-17510: Ok, next time I get some free hacking time I can make a branch for you to try out, to see if it addresses the issue > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17100) pyspark filter on a udf column after join gives java.lang.UnsupportedOperationException
[ https://issues.apache.org/jira/browse/SPARK-17100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-17100: -- Assignee: Davies Liu > pyspark filter on a udf column after join gives > java.lang.UnsupportedOperationException > --- > > Key: SPARK-17100 > URL: https://issues.apache.org/jira/browse/SPARK-17100 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: spark-2.0.0-bin-hadoop2.7. Python2 and Python3. >Reporter: Tim Sell >Assignee: Davies Liu > Attachments: bug.py, test_bug.py > > > In pyspark, when filtering on a udf derived column after some join types, > the optimized logical plan results is a > java.lang.UnsupportedOperationException. > I could not replicate this in scala code from the shell, just python. It is a > pyspark regression from spark 1.6.2. > This can be replicated with: bin/spark-submit bug.py > {code:python:title=bug.py} > import pyspark.sql.functions as F > from pyspark.sql import Row, SparkSession > if __name__ == '__main__': > spark = SparkSession.builder.appName("test").getOrCreate() > left = spark.createDataFrame([Row(a=1)]) > right = spark.createDataFrame([Row(a=1)]) > df = left.join(right, on='a', how='left_outer') > df = df.withColumn('b', F.udf(lambda x: 'x')(df.a)) > df = df.filter('b = "x"') > df.explain(extended=True) > {code} > The output is: > {code} > == Parsed Logical Plan == > 'Filter ('b = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Analyzed Logical Plan == > a: bigint, b: string > Filter (b#8 = x) > +- Project [a#0L, (a#0L) AS b#8] >+- Project [a#0L] > +- Join LeftOuter, (a#0L = a#3L) > :- LogicalRDD [a#0L] > +- LogicalRDD [a#3L] > == Optimized Logical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > == Physical Plan == > java.lang.UnsupportedOperationException: Cannot evaluate expression: > (input[0, bigint, true]) > {code} > It fails when the join is: > * how='outer', on=column expression > * how='left_outer', on=string or column expression > * how='right_outer', on=string or column expression > It passes when the join is: > * how='inner', on=string or column expression > * how='outer', on=string > I made some tests to demonstrate each of these. > Run with bin/spark-submit test_bug.py -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak
[ https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17465: --- Fix Version/s: 2.1.0 2.0.1 > Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may > lead to memory leak > - > > Key: SPARK-17465 > URL: https://issues.apache.org/jira/browse/SPARK-17465 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1, 1.6.2 >Reporter: Xing Shi >Assignee: Xing Shi > Fix For: 1.6.3, 2.0.1, 2.1.0 > > > After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a > memory leak on my Spark streaming application. > Here is the head of the heap histogram of my application, which has been > running about 160 hours: > {code:borderStyle=solid} > num #instances #bytes class name > -- >1: 28094 71753976 [B >2: 1188086 28514064 java.lang.Long >3: 1183844 28412256 scala.collection.mutable.DefaultEntry >4:102242 13098768 >5:102242 12421000 >6: 81849199032 >7:388391584 [Lscala.collection.mutable.HashEntry; >8: 81847514288 >9: 66514874080 > 10: 371973438040 [C > 11: 64232445640 > 12: 87731044808 java.lang.Class > 13: 36869 884856 java.lang.String > 14: 15715 848368 [[I > 15: 13690 782808 [S > 16: 18903 604896 > java.util.concurrent.ConcurrentHashMap$HashEntry > 17:13 426192 [Lscala.concurrent.forkjoin.ForkJoinTask; > {code} > It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* > have unexpected big numbers of instances. In fact, the numbers started > growing at streaming process began, and keep growing proportional to total > number of tasks. > After some further investigation, I found that the problem is caused by some > inappropriate memory management in _releaseUnrollMemoryForThisTask_ and > _unrollSafely_ method of class > [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala]. > In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be > processed only with the parameter _memoryToRelease_ > 0: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537 > But in fact, if a task successfully unrolled all its blocks in memory by > _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to > zero: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322 > So the result is, the memory saved in _unrollMemoryMap_ will be released, but > the key of that part of memory will never be removed from the hash map. The > hash table will keep increasing, while new tasks keep incoming. Although the > speed of increase is comparatively slow (about dozens of bytes per task), it > is possible that result into OOM after weeks or months. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491445#comment-15491445 ] Brady Auen commented on SPARK-17544: http://stackoverflow.com/questions/39498492/spark-and-amazon-emr-s3-connections-not-being-closed > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) > at > org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSourc
[jira] [Commented] (SPARK-17114) Adding a 'GROUP BY 1' where first column is literal results in wrong answer
[ https://issues.apache.org/jira/browse/SPARK-17114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491437#comment-15491437 ] Apache Spark commented on SPARK-17114: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/15101 > Adding a 'GROUP BY 1' where first column is literal results in wrong answer > --- > > Key: SPARK-17114 > URL: https://issues.apache.org/jira/browse/SPARK-17114 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2, 2.0.0 >Reporter: Josh Rosen > Labels: correctness > > Consider the following example: > {code} > sc.parallelize(Seq(128, 256)).toDF("int_col").registerTempTable("mytable") > // The following query should return an empty result set because the `IN` > filter condition is always false for this single-row table. > val withoutGroupBy = sqlContext.sql(""" > SELECT 'foo' > FROM mytable > WHERE int_col == 0 > """) > assert(withoutGroupBy.collect().isEmpty, "original query returned wrong > answer") > // After adding a 'GROUP BY 1' the query result should still be empty because > we'd be grouping an empty table: > val withGroupBy = sqlContext.sql(""" > SELECT 'foo' > FROM mytable > WHERE int_col == 0 > GROUP BY 1 > """) > assert(withGroupBy.collect().isEmpty, "adding GROUP BY resulted in wrong > answer") > {code} > Here, this fails the second assertion by returning a single row. It appears > that running {{group by 1}} where column 1 is a constant causes filter > conditions to be ignored. > Both PostgreSQL and SQLite return empty result sets for the query containing > the {{GROUP BY}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak
[ https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-17465: --- Assignee: Xing Shi > Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may > lead to memory leak > - > > Key: SPARK-17465 > URL: https://issues.apache.org/jira/browse/SPARK-17465 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1, 1.6.2 >Reporter: Xing Shi >Assignee: Xing Shi > Fix For: 1.6.3 > > > After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a > memory leak on my Spark streaming application. > Here is the head of the heap histogram of my application, which has been > running about 160 hours: > {code:borderStyle=solid} > num #instances #bytes class name > -- >1: 28094 71753976 [B >2: 1188086 28514064 java.lang.Long >3: 1183844 28412256 scala.collection.mutable.DefaultEntry >4:102242 13098768 >5:102242 12421000 >6: 81849199032 >7:388391584 [Lscala.collection.mutable.HashEntry; >8: 81847514288 >9: 66514874080 > 10: 371973438040 [C > 11: 64232445640 > 12: 87731044808 java.lang.Class > 13: 36869 884856 java.lang.String > 14: 15715 848368 [[I > 15: 13690 782808 [S > 16: 18903 604896 > java.util.concurrent.ConcurrentHashMap$HashEntry > 17:13 426192 [Lscala.concurrent.forkjoin.ForkJoinTask; > {code} > It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* > have unexpected big numbers of instances. In fact, the numbers started > growing at streaming process began, and keep growing proportional to total > number of tasks. > After some further investigation, I found that the problem is caused by some > inappropriate memory management in _releaseUnrollMemoryForThisTask_ and > _unrollSafely_ method of class > [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala]. > In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be > processed only with the parameter _memoryToRelease_ > 0: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537 > But in fact, if a task successfully unrolled all its blocks in memory by > _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to > zero: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322 > So the result is, the memory saved in _unrollMemoryMap_ will be released, but > the key of that part of memory will never be removed from the hash map. The > hash table will keep increasing, while new tasks keep incoming. Although the > speed of increase is comparatively slow (about dozens of bytes per task), it > is possible that result into OOM after weeks or months. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17465) Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may lead to memory leak
[ https://issues.apache.org/jira/browse/SPARK-17465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17465. Resolution: Fixed Fix Version/s: 1.6.3 Issue resolved by pull request 15022 [https://github.com/apache/spark/pull/15022] > Inappropriate memory management in `org.apache.spark.storage.MemoryStore` may > lead to memory leak > - > > Key: SPARK-17465 > URL: https://issues.apache.org/jira/browse/SPARK-17465 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.0, 1.6.1, 1.6.2 >Reporter: Xing Shi > Fix For: 1.6.3 > > > After updating Spark from 1.5.0 to 1.6.0, I found that it seems to have a > memory leak on my Spark streaming application. > Here is the head of the heap histogram of my application, which has been > running about 160 hours: > {code:borderStyle=solid} > num #instances #bytes class name > -- >1: 28094 71753976 [B >2: 1188086 28514064 java.lang.Long >3: 1183844 28412256 scala.collection.mutable.DefaultEntry >4:102242 13098768 >5:102242 12421000 >6: 81849199032 >7:388391584 [Lscala.collection.mutable.HashEntry; >8: 81847514288 >9: 66514874080 > 10: 371973438040 [C > 11: 64232445640 > 12: 87731044808 java.lang.Class > 13: 36869 884856 java.lang.String > 14: 15715 848368 [[I > 15: 13690 782808 [S > 16: 18903 604896 > java.util.concurrent.ConcurrentHashMap$HashEntry > 17:13 426192 [Lscala.concurrent.forkjoin.ForkJoinTask; > {code} > It shows that *scala.collection.mutable.DefaultEntry* and *java.lang.Long* > have unexpected big numbers of instances. In fact, the numbers started > growing at streaming process began, and keep growing proportional to total > number of tasks. > After some further investigation, I found that the problem is caused by some > inappropriate memory management in _releaseUnrollMemoryForThisTask_ and > _unrollSafely_ method of class > [org.apache.spark.storage.MemoryStore|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala]. > In Spark 1.6.x, a _releaseUnrollMemoryForThisTask_ operation will be > processed only with the parameter _memoryToRelease_ > 0: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L530-L537 > But in fact, if a task successfully unrolled all its blocks in memory by > _unrollSafely_ method, the memory saved in _unrollMemoryMap_ would be set to > zero: > https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/storage/MemoryStore.scala#L322 > So the result is, the memory saved in _unrollMemoryMap_ will be released, but > the key of that part of memory will never be removed from the hash map. The > hash table will keep increasing, while new tasks keep incoming. Although the > speed of increase is comparatively slow (about dozens of bytes per task), it > is possible that result into OOM after weeks or months. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17546) start-* scripts should use hostname --fqdn
Kevin Burton created SPARK-17546: Summary: start-* scripts should use hostname --fqdn Key: SPARK-17546 URL: https://issues.apache.org/jira/browse/SPARK-17546 Project: Spark Issue Type: Bug Affects Versions: 2.0.0 Reporter: Kevin Burton The ./sbin/start-slaves.sh and ./sbin/start-master.sh scripts use just 'hostname' and if /etc/hostname isn't using a fqdn then you don't get the fully qualified domain name. If you upgrade your script to hostname --fqdn the problem is solved. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491412#comment-15491412 ] Jeff Nadler commented on SPARK-17510: - Well... both streams use updateStateByKey. The session one is simple+fast, the event one is complex+slow. I have a branch now where I experiment with mapWithState. I don't expect huge benefits because we need expiration, and I understand from the docs that this negates a lot of the benefit. I'd still love to get some perf gains of course :) Don't have a Kafka 0.10 cluster right now but could stand one up pretty quick. > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17545) Spark SQL Catalyst doesn't handle ISO 8601 date with colon in offset
Nathan Beyer created SPARK-17545: Summary: Spark SQL Catalyst doesn't handle ISO 8601 date with colon in offset Key: SPARK-17545 URL: https://issues.apache.org/jira/browse/SPARK-17545 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Nathan Beyer When parsing a CSV with a date/time column that contains a variant ISO 8601 that doesn't include a colon in the offset, casting to Timestamp fails. Here's a simple, example CSV content. {quote} time "2015-07-20T15:09:23.736-0500" "2015-07-20T15:10:51.687-0500" "2015-11-21T23:15:01.499-0600" {quote} Here's the stack trace that results from processing this data. {quote} 16/09/14 15:22:59 ERROR Utils: Aborting task java.lang.IllegalArgumentException: 2015-11-21T23:15:01.499-0600 at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.skip(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl$Parser.parse(Unknown Source) at org.apache.xerces.jaxp.datatype.XMLGregorianCalendarImpl.(Unknown Source) at org.apache.xerces.jaxp.datatype.DatatypeFactoryImpl.newXMLGregorianCalendar(Unknown Source) at javax.xml.bind.DatatypeConverterImpl._parseDateTime(DatatypeConverterImpl.java:422) at javax.xml.bind.DatatypeConverterImpl.parseDateTime(DatatypeConverterImpl.java:417) at javax.xml.bind.DatatypeConverter.parseDateTime(DatatypeConverter.java:327) at org.apache.spark.sql.catalyst.util.DateTimeUtils$.stringToTime(DateTimeUtils.scala:140) at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:287) {quote} Somewhat related, I believe Python standard libraries can produce this form of zone offset. The system I got the data from is written in Python. https://docs.python.org/2/library/datetime.html#strftime-strptime-behavior -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17472) Better error message for serialization failures of large objects in Python
[ https://issues.apache.org/jira/browse/SPARK-17472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17472. Resolution: Fixed Fix Version/s: 2.1.0 Issue resolved by pull request 15026 [https://github.com/apache/spark/pull/15026] > Better error message for serialization failures of large objects in Python > -- > > Key: SPARK-17472 > URL: https://issues.apache.org/jira/browse/SPARK-17472 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Eric Liang >Priority: Minor > Fix For: 2.1.0 > > > {code} > def run(): > import numpy.random as nr > b = nr.bytes(8 * 10) > sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() > run() > {code} > Gives you the following error from pickle > {code} > error: 'i' format requires -2147483648 <= number <= 2147483647 > --- > error Traceback (most recent call last) > in () > 4 sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() > 5 > > 6 run() > in run() > 2 import numpy.random as nr > 3 b = nr.bytes(8 * 10) > > 4 sc.parallelize(range(1000), 1000).map(lambda x: len(b)).count() > 5 > 6 run() > /databricks/spark/python/pyspark/rdd.pyc in count(self) >1002 3 >1003 """ > -> 1004 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum() >1005 >1006 def stats(self): > /databricks/spark/python/pyspark/rdd.pyc in sum(self) > 993 6.0 > 994 """ > --> 995 return self.mapPartitions(lambda x: [sum(x)]).fold(0, > operator.add) > 996 > 997 def count(self): > /databricks/spark/python/pyspark/rdd.pyc in fold(self, zeroValue, op) > 867 # zeroValue provided to each partition is unique from the one > provided > 868 # to the final reduce call > --> 869 vals = self.mapPartitions(func).collect() > 870 return reduce(op, vals, zeroValue) > 871 > /databricks/spark/python/pyspark/rdd.pyc in collect(self) > 769 """ > 770 with SCCallSiteSync(self.context) as css: > --> 771 port = > self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd()) > 772 return list(_load_from_socket(port, self._jrdd_deserializer)) > 773 > /databricks/spark/python/pyspark/rdd.pyc in _jrdd(self) >2377 command = (self.func, profiler, self._prev_jrdd_deserializer, >2378self._jrdd_deserializer) > -> 2379 pickled_cmd, bvars, env, includes = > _prepare_for_python_RDD(self.ctx, command, self) >2380 python_rdd = self.ctx._jvm.PythonRDD(self._prev_jrdd.rdd(), >2381 bytearray(pickled_cmd), > /databricks/spark/python/pyspark/rdd.pyc in _prepare_for_python_RDD(sc, > command, obj) >2297 # the serialized command will be compressed by broadcast >2298 ser = CloudPickleSerializer() > -> 2299 pickled_command = ser.dumps(command) >2300 if len(pickled_command) > (1 << 20): # 1M >2301 # The broadcast will have same life cycle as created PythonRDD > /databricks/spark/python/pyspark/serializers.pyc in dumps(self, obj) > 426 > 427 def dumps(self, obj): > --> 428 return cloudpickle.dumps(obj, 2) > 429 > 430 > /databricks/spark/python/pyspark/cloudpickle.pyc in dumps(obj, protocol) > 655 > 656 cp = CloudPickler(file,protocol) > --> 657 cp.dump(obj) > 658 > 659 return file.getvalue() > /databricks/spark/python/pyspark/cloudpickle.pyc in dump(self, obj) > 105 self.inject_addons() > 106 try: > --> 107 return Pickler.dump(self, obj) > 108 except RuntimeError as e: > 109 if 'recursion' in e.args[0]: > /usr/lib/python2.7/pickle.pyc in dump(self, obj) > 222 if self.proto >= 2: > 223 self.write(PROTO + chr(self.proto)) > --> 224 self.save(obj) > 225 self.write(STOP) > 226 > /usr/lib/python2.7/pickle.pyc in save(self, obj) > 284 f = self.dispatch.get(t) > 285 if f: > --> 286 f(self, obj) # Call unbound method with explicit self > 287 return > 288 > /usr/lib/python2.7/pickle.pyc in save_tuple(self, obj) > 560 write(MARK) > 561 for element in obj: > --> 562 save(element) > 563 > 564 if id(obj) in memo: > /usr/lib/python2.7/pickle.pyc in save(self, obj) > 284 f = self.dispatch.get(t) > 285 if f: > --> 286 f(self, obj) # C
[jira] [Commented] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
[ https://issues.apache.org/jira/browse/SPARK-17544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491397#comment-15491397 ] Davies Liu commented on SPARK-17544: Could you post some code to reproduce the issue? > Timeout waiting for connection from pool, DataFrame Reader's not closing S3 > connections? > > > Key: SPARK-17544 > URL: https://issues.apache.org/jira/browse/SPARK-17544 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.0.0 > Environment: Amazon EMR, S3, Scala >Reporter: Brady Auen > Labels: newbie > > I have an application that loops through a text file to find files in S3 and > then reads them in, performs some ETL processes, and then writes them out. > This works for around 80 loops until I get this: > > 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening > 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' > for reading > 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: > Timeout waiting for connection from pool > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: > Timeout waiting for connection from pool > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) > at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown > Source) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) > at > com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) > at > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) > at > com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) > at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) > at > com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) > at > com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) > at > org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324) > at > org.apache.spark.sql.
[jira] [Resolved] (SPARK-17463) Serialization of accumulators in heartbeats is not thread-safe
[ https://issues.apache.org/jira/browse/SPARK-17463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-17463. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15063 [https://github.com/apache/spark/pull/15063] > Serialization of accumulators in heartbeats is not thread-safe > -- > > Key: SPARK-17463 > URL: https://issues.apache.org/jira/browse/SPARK-17463 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: Josh Rosen >Assignee: Shixiong Zhu >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > Check out the following {{ConcurrentModificationException}}: > {code} > 16/09/06 16:10:29 WARN NettyRpcEndpointRef: Error sending message [message = > Heartbeat(2,[Lscala.Tuple2;@66e7b6e7,BlockManagerId(2, HOST, 57743))] in 1 > attempts > org.apache.spark.SparkException: Exception thrown in awaitResult > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75) > at > scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at > org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59) > at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83) > at > org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102) > at > org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$reportHeartBeat(Executor.scala:518) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply$mcV$sp(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at > org.apache.spark.executor.Executor$$anon$1$$anonfun$run$1.apply(Executor.scala:547) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1862) > at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:547) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: java.util.ConcurrentModificationException > at java.util.ArrayList.writeObject(ArrayList.java:766) > at sun.reflect.GeneratedMethodAccessor20.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:497) > at > java.io.ObjectStreamClass.invokeWriteObject(ObjectStreamClass.java:1028) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1496) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOutputStream.java:1378) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1174) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at > java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) > at > java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) > at > java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) > at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) > at java.io.ObjectOutputStream.writeArray(ObjectOut
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491389#comment-15491389 ] Cody Koeninger commented on SPARK-17510: Just for clarity's sake, compute time is far higher on the stream that is using updateStateByKey? Have you tried mapWIthState? Changing max rate to be per-partition isn't actually a big change in terms of number of lines, the calculations are already done per partition because of backpressure. It's more a question of whether it's worth adding more surface area to the creation api. If I make a branch, are you in a position to test it with a kafka 0.10 cluster, or not? > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17511) Dynamic allocation race condition: Containers getting marked failed while releasing
[ https://issues.apache.org/jira/browse/SPARK-17511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-17511. --- Resolution: Fixed Assignee: Kishor Patil Fix Version/s: 2.1.0 2.0.1 > Dynamic allocation race condition: Containers getting marked failed while > releasing > --- > > Key: SPARK-17511 > URL: https://issues.apache.org/jira/browse/SPARK-17511 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.0.0, 2.0.1, 2.1.0 >Reporter: Kishor Patil >Assignee: Kishor Patil > Fix For: 2.0.1, 2.1.0 > > > While trying to reach launch multiple containers in pool, if running > executors count reaches or goes beyond the target running executors, the > container is released and marked failed. This can cause many jobs to be > marked failed causing overall job failure. > I will have a patch up soon after completing testing. > {panel:title=Typical Exception found in Driver marking the container to > Failed} > {code} > java.lang.AssertionError: assertion failed > at scala.Predef$.assert(Predef.scala:156) > at > org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489) > at > org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > {panel} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16424) Add support for Structured Streaming to the ML Pipeline API
[ https://issues.apache.org/jira/browse/SPARK-16424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491250#comment-15491250 ] holdenk commented on SPARK-16424: - Just an update - we have a really early proof of concept branch showing something like this approach can work. After the Strata talk later this month were planning on collecting any additional feedback and working on getting that proof of concept into a shape where we can make an early draft PR for people that are more interested in reviewing those than looking at design documents. While I've reached out to [~marmbrus] & [~rams] on e-mail just pinging on the JIRA as well in-case you have a chance to take a look. Also cc [~tdas] who was helpful with answering some of my questions if you have any bandwidth to look at this that would be greatly appreciated. > Add support for Structured Streaming to the ML Pipeline API > --- > > Key: SPARK-16424 > URL: https://issues.apache.org/jira/browse/SPARK-16424 > Project: Spark > Issue Type: Improvement > Components: ML, SQL, Streaming >Reporter: holdenk > > For Spark 2.1 we should consider adding support for machine learning on top > of the structured streaming API. > Early work in progress design outline: > https://docs.google.com/document/d/1snh7x7b0dQIlTsJNHLr-IxIFgP43RfRV271YK2qGiFQ/edit?usp=sharing -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17544) Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections?
Brady Auen created SPARK-17544: -- Summary: Timeout waiting for connection from pool, DataFrame Reader's not closing S3 connections? Key: SPARK-17544 URL: https://issues.apache.org/jira/browse/SPARK-17544 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.0.0 Environment: Amazon EMR, S3, Scala Reporter: Brady Auen I have an application that loops through a text file to find files in S3 and then reads them in, performs some ETL processes, and then writes them out. This works for around 80 loops until I get this: 16/09/14 18:58:23 INFO S3NativeFileSystem: Opening 's3://webpt-emr-us-west-2-hadoop/edw/master/fdbk_ideavote/20160907/part-m-0.avro' for reading 16/09/14 18:59:13 INFO AmazonHttpClient: Unable to execute HTTP request: Timeout waiting for connection from pool com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectionPoolTimeoutException: Timeout waiting for connection from pool at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager.leaseConnection(PoolingClientConnectionManager.java:226) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.conn.PoolingClientConnectionManager$1.getConnection(PoolingClientConnectionManager.java:195) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.ClientConnectionRequestFactory$Handler.invoke(ClientConnectionRequestFactory.java:70) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.conn.$Proxy37.getConnection(Unknown Source) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:423) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeOneRequest(AmazonHttpClient.java:837) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:607) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.doExecute(AmazonHttpClient.java:376) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.executeWithTimer(AmazonHttpClient.java:338) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:287) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3826) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1015) at com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:991) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:212) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy36.retrieveMetadata(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.getFileStatus(S3NativeFileSystem.java:780) at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1428) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.exists(EmrFileSystem.java:313) at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:289) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:324) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:452) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$18.apply(DataSource.scala:458) at scala.util.Try$.apply(Try.scala:192) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:451) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:211) at org.apache
[jira] [Resolved] (SPARK-10747) add support for NULLS FIRST|LAST in ORDER BY clause
[ https://issues.apache.org/jira/browse/SPARK-10747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell resolved SPARK-10747. --- Resolution: Fixed Assignee: Xin Wu Fix Version/s: 2.1.0 > add support for NULLS FIRST|LAST in ORDER BY clause > --- > > Key: SPARK-10747 > URL: https://issues.apache.org/jira/browse/SPARK-10747 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.0 >Reporter: N Campbell >Assignee: Xin Wu > Fix For: 2.1.0 > > > You cannot express how NULLS are to be sorted in the window order > specification and have to use a compensating expression to simulate. > Error: org.apache.spark.sql.AnalysisException: line 1:76 missing ) at 'nulls' > near 'nulls' > line 1:82 missing EOF at 'last' near 'nulls'; > SQLState: null > Same limitation as Hive reported in Apache JIRA HIVE-9535 ) > This fails > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by c3 desc > nulls last) from tolap > select rnum, c1, c2, c3, dense_rank() over(partition by c1 order by case when > c3 is null then 1 else 0 end) from tolap -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186 ] Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:53 PM: - Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}} (which apparently is the way one has to write the code to run without error)? was (Author: zamir.e...@gmail.com): Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}}? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186 ] Evan Zamir edited comment on SPARK-17508 at 9/14/16 6:52 PM: - Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says {{weightCol=None}}, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read {{weightCol=""}}? was (Author: zamir.e...@gmail.com): Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says `weightCol=None`, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read `weightCol=""`? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491186#comment-15491186 ] Evan Zamir commented on SPARK-17508: Honestly, if the documentation was just more explicit, users wouldn't be so confused. But when it says `weightCol=None`, there's only one way we can interpret that in Python, and it happens to produce an error. Why doesn't someone just change the docstring to read `weightCol=""`? > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17508) Setting weightCol to None in ML library causes an error
[ https://issues.apache.org/jira/browse/SPARK-17508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491163#comment-15491163 ] Bryan Cutler commented on SPARK-17508: -- I had a similar discussion in this PR https://github.com/apache/spark/pull/12790 about supporting a param value of {{None}}. I think people mostly just wanted to keep PySpark in sync with the Scala API and not add additional support for {{None}}. > Setting weightCol to None in ML library causes an error > --- > > Key: SPARK-17508 > URL: https://issues.apache.org/jira/browse/SPARK-17508 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.0.0 >Reporter: Evan Zamir >Priority: Minor > > The following code runs without error: > {code} > spark = SparkSession.builder.appName('WeightBug').getOrCreate() > df = spark.createDataFrame( > [ > (1.0, 1.0, Vectors.dense(1.0)), > (0.0, 1.0, Vectors.dense(-1.0)) > ], > ["label", "weight", "features"]) > lr = LogisticRegression(maxIter=5, regParam=0.0, weightCol="weight") > model = lr.fit(df) > {code} > My expectation from reading the documentation is that setting weightCol=None > should treat all weights as 1.0 (regardless of whether a column exists). > However, the same code with weightCol set to None causes the following errors: > Traceback (most recent call last): > File "/Users/evanzamir/ams/px-seed-model/scripts/bug.py", line 32, in > > model = lr.fit(df) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/base.py", line > 64, in fit > return self._fit(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 213, in _fit > java_model = self._fit_java(dataset) > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/ml/wrapper.py", > line 210, in _fit_java > return self._java_obj.fit(dataset._jdf) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/usr/local/spark-2.0.0-bin-hadoop2.7/python/pyspark/sql/utils.py", > line 63, in deco > return f(*a, **kw) > File > "/usr/local/spark-2.0.0-bin-hadoop2.7/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", > line 312, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o38.fit. > : java.lang.NullPointerException > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:264) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:259) > at > org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:159) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:280) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at py4j.GatewayConnection.run(GatewayConnection.java:211) > at java.lang.Thread.run(Thread.java:745) > Process finished with exit code 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491165#comment-15491165 ] Jeff Nadler commented on SPARK-17510: - Yes you're right - it's partly about differing rates but the big issue is that the compute time is far higher on one stream vs the other. That's the only reason we need rate limiting at all, really.Overprovisioning would be expensive in this case. We have run at a higher maxRate and let backpressure manage it. It works, but it's suboptimal. During surges of data, the rate limiter will fluctuate widely up/down as scheduling delay builds up & is burned off. The algorithm is such that it does not ever level out, tho that's a separate opportunity for improvement in the backpressure impl. When these big fluctuations in inbound rate happen, the total throughput on a longer time period (per hour) is far lower than it would be if we just calibrate a maxRate that's reflective (or close) to what our cluster is capable of handling. Also just to be clear I understand that this is probably a long-haul change, and nothing that's going to solve our issues at this time. It seems to me like this would be the right long term direction, but you all who are committers may not agree. For sure I am starting to feel like I'm stacking up workarounds to make Spark Streaming viable for our particular use. In my little fantasyland, there would be separate "SparkConf" for global settings and "StreamConf" that can be passed to the various KafkaUtils.* functions to set the spark.streaming.* settings independently for each stream. > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16534) Kafka 0.10 Python support
[ https://issues.apache.org/jira/browse/SPARK-16534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491107#comment-15491107 ] Maciej BryĆski commented on SPARK-16534: [~rxin] Could you explain your decision ? I think that dropping Python support is very bad sign for Spark community. In my example I started with batch jobs in Python. Having ability to use same code in Streaming Job was the main reason to choose Spark. Eventually I'm here with production code written in Python supporting both batch and streaming. And I won't be able to use new Kafka features like SSL. Never. Am I right ? Or maybe I missed something. As far as I understand your point is that support for Python in Spark Streaming is wrong and we shouldn't develop new features. I don't agree with that. > Kafka 0.10 Python support > - > > Key: SPARK-16534 > URL: https://issues.apache.org/jira/browse/SPARK-16534 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Reporter: Tathagata Das > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491087#comment-15491087 ] Cody Koeninger edited comment on SPARK-17510 at 9/14/16 6:12 PM: - I use direct stream for multiple topic jobs where the rate varies by a factor of 1000x or more across topics, so I don't agree that case was "not really considered" :) It sounds more like the issue you're running into is that the amount of work per item varies widely across topics, not the rate across topics? Have you actually tried setting max rate at a level that's sufficient just to prevent crash on initial startup, then letting backpressure take over from there? Edit - Also, you mentioned using updateStateByKey. updateStateByKey does not scale particularly well, have you investigated mapWithState? was (Author: c...@koeninger.org): I use direct stream for multiple topic jobs where the rate varies by a factor of 1000x or more across topics, so I don't agree that case was "not really considered" :) It sounds more like the issue you're running into is that the amount of work per item varies widely across topics, not the rate across topics? Have you actually tried setting max rate at a level that's sufficient just to prevent crash on initial startup, then letting backpressure take over from there? > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15491087#comment-15491087 ] Cody Koeninger commented on SPARK-17510: I use direct stream for multiple topic jobs where the rate varies by a factor of 1000x or more across topics, so I don't agree that case was "not really considered" :) It sounds more like the issue you're running into is that the amount of work per item varies widely across topics, not the rate across topics? Have you actually tried setting max rate at a level that's sufficient just to prevent crash on initial startup, then letting backpressure take over from there? > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17542) Compiler warning in UnsafeInMemorySorter class
[ https://issues.apache.org/jira/browse/SPARK-17542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-17542: -- Issue Type: Improvement (was: Bug) There are unfortunately a number of warnings, and I don't think we want a JIRA for each of them. Somebody whacks a lot of them periodically. So, sure, they're good to resolve but I'd be more interested in sweeps that clean up as many warnings as possible. Some are un-suppressable and un-resolvable, like tests for deprecated methods in Scala. This isn't a bug though. > Compiler warning in UnsafeInMemorySorter class > -- > > Key: SPARK-17542 > URL: https://issues.apache.org/jira/browse/SPARK-17542 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Frederick Reiss >Priority: Trivial > Labels: starter > > *This is a small starter task to help new contributors practice the pull > request and code review process.* > When building Spark with Java 8, there is a compiler warning in the class > UnsafeInMemorySorter: > {noformat} > [WARNING] > .../core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:284: > warning: [static] > static method should be qualified by type name, TaskMemoryManager, instead of > by an expression > {noformat} > Spark should compile without these kinds of warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17317) Add package vignette to SparkR
[ https://issues.apache.org/jira/browse/SPARK-17317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490995#comment-15490995 ] Apache Spark commented on SPARK-17317: -- User 'junyangq' has created a pull request for this issue: https://github.com/apache/spark/pull/15100 > Add package vignette to SparkR > -- > > Key: SPARK-17317 > URL: https://issues.apache.org/jira/browse/SPARK-17317 > Project: Spark > Issue Type: Improvement >Reporter: Junyang Qian >Assignee: Junyang Qian > Fix For: 2.1.0 > > > In publishing SparkR to CRAN, it would be nice to have a vignette as a user > guide that > * describes the big picture > * introduces the use of various methods > This is important for new users because they may not even know which method > to look up. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17514) df.take(1) and df.limit(1).collect() perform differently in Python
[ https://issues.apache.org/jira/browse/SPARK-17514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-17514. Resolution: Fixed Fix Version/s: 2.1.0 2.0.1 Issue resolved by pull request 15068 [https://github.com/apache/spark/pull/15068] > df.take(1) and df.limit(1).collect() perform differently in Python > -- > > Key: SPARK-17514 > URL: https://issues.apache.org/jira/browse/SPARK-17514 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Josh Rosen >Assignee: Josh Rosen > Fix For: 2.0.1, 2.1.0 > > > In PySpark, {{df.take(1)}} ends up running a single-stage job which computes > only one partition of {{df}}, while {{df.limit(1).collect()}} ends up > computing all partitions of {{df}} and runs a two-stage job. This difference > in performance is confusing, so I think that we should generalize the fix > from SPARK-10731 so that {{Dataset.collect()}} can be implemented efficiently > in Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17543) Missing log4j config file for tests in common/network-shuffle
Frederick Reiss created SPARK-17543: --- Summary: Missing log4j config file for tests in common/network-shuffle Key: SPARK-17543 URL: https://issues.apache.org/jira/browse/SPARK-17543 Project: Spark Issue Type: Bug Reporter: Frederick Reiss Priority: Trivial *This is a small starter task to help new contributors practice the pull request and code review process.* The Maven module {{common/network-shuffle}} does not have a log4j configuration file for its test cases. Usually these configuration files are located inside each module, in the directory {{src/test/resources}}. The missing configuration file leads to a scary-looking but harmless series of errors and stack traces in Spark build logs: {noformat} SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. log4j:ERROR Could not read configuration file from URL [file:src/test/resources/log4j.properties]. java.io.FileNotFoundException: src/test/resources/log4j.properties (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at java.io.FileInputStream.(FileInputStream.java:101) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.(LogManager.java:127) at org.apache.log4j.Logger.getLogger(Logger.java:104) at io.netty.util.internal.logging.Log4JLoggerFactory.newInstance(Log4JLoggerFactory.java:29) at io.netty.util.internal.logging.InternalLoggerFactory.newDefaultFactory(InternalLoggerFactory.java:46) at io.netty.util.internal.logging.InternalLoggerFactory.(InternalLoggerFactory.java:34) ... {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17542) Compiler warning in UnsafeInMemorySorter class
Frederick Reiss created SPARK-17542: --- Summary: Compiler warning in UnsafeInMemorySorter class Key: SPARK-17542 URL: https://issues.apache.org/jira/browse/SPARK-17542 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Frederick Reiss Priority: Trivial *This is a small starter task to help new contributors practice the pull request and code review process.* When building Spark with Java 8, there is a compiler warning in the class UnsafeInMemorySorter: {noformat} [WARNING] .../core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression {noformat} Spark should compile without these kinds of warnings. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists
[ https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490922#comment-15490922 ] Apache Spark commented on SPARK-17541: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/15099 > fix some DDL bugs about table management when same-name temp view exists > > > Key: SPARK-17541 > URL: https://issues.apache.org/jira/browse/SPARK-17541 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists
[ https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17541: Assignee: Wenchen Fan (was: Apache Spark) > fix some DDL bugs about table management when same-name temp view exists > > > Key: SPARK-17541 > URL: https://issues.apache.org/jira/browse/SPARK-17541 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists
[ https://issues.apache.org/jira/browse/SPARK-17541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17541: Assignee: Apache Spark (was: Wenchen Fan) > fix some DDL bugs about table management when same-name temp view exists > > > Key: SPARK-17541 > URL: https://issues.apache.org/jira/browse/SPARK-17541 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17541) fix some DDL bugs about table management when same-name temp view exists
Wenchen Fan created SPARK-17541: --- Summary: fix some DDL bugs about table management when same-name temp view exists Key: SPARK-17541 URL: https://issues.apache.org/jira/browse/SPARK-17541 Project: Spark Issue Type: Bug Components: SQL Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17536) Minor performance improvement to JDBC batch inserts
[ https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17536: Assignee: (was: Apache Spark) > Minor performance improvement to JDBC batch inserts > --- > > Key: SPARK-17536 > URL: https://issues.apache.org/jira/browse/SPARK-17536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: John Muller >Priority: Trivial > Labels: perfomance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > JDBC batch inserts currently are set to repeatedly retrieve the number of > fields inside the row iterator: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598 > val numFields = rddSchema.fields.length > This value does not change and can be set prior to the loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17536) Minor performance improvement to JDBC batch inserts
[ https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490849#comment-15490849 ] Apache Spark commented on SPARK-17536: -- User 'blue666man' has created a pull request for this issue: https://github.com/apache/spark/pull/15098 > Minor performance improvement to JDBC batch inserts > --- > > Key: SPARK-17536 > URL: https://issues.apache.org/jira/browse/SPARK-17536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: John Muller >Priority: Trivial > Labels: perfomance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > JDBC batch inserts currently are set to repeatedly retrieve the number of > fields inside the row iterator: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598 > val numFields = rddSchema.fields.length > This value does not change and can be set prior to the loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17536) Minor performance improvement to JDBC batch inserts
[ https://issues.apache.org/jira/browse/SPARK-17536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17536: Assignee: Apache Spark > Minor performance improvement to JDBC batch inserts > --- > > Key: SPARK-17536 > URL: https://issues.apache.org/jira/browse/SPARK-17536 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: John Muller >Assignee: Apache Spark >Priority: Trivial > Labels: perfomance > Original Estimate: 0.5h > Remaining Estimate: 0.5h > > JDBC batch inserts currently are set to repeatedly retrieve the number of > fields inside the row iterator: > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L598 > val numFields = rddSchema.fields.length > This value does not change and can be set prior to the loop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15835) The read path of json doesn't support write path when schema contains Options
[ https://issues.apache.org/jira/browse/SPARK-15835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490836#comment-15490836 ] Chris Horn commented on SPARK-15835: You can work around this issue by providing the schema StructType up front to the JSON reader. This also has the added benefit of not eagerly scanning the entire JSON data set to derive a schema. {code} scala> spark.read.schema(org.apache.spark.sql.Encoders.product[Bug].schema).json(path).as[Bug].collect res8: Array[Bug] = Array(Bug(abc,None)) scala> spark.read.schema(org.apache.spark.sql.Encoders.product[Bug].schema).json(path).collect res9: Array[org.apache.spark.sql.Row] = Array([abc,null]) {code} > The read path of json doesn't support write path when schema contains Options > - > > Key: SPARK-15835 > URL: https://issues.apache.org/jira/browse/SPARK-15835 > Project: Spark > Issue Type: Bug >Reporter: Burak Yavuz > > my schema contains optional fields. When these fields are written in json > (and all of these records are None), the field will be omitted during writes. > When reading, these fields can't be found and this throws an exception. > Either during writes, the fields should be included as `null`, or the Dataset > should not require the field to exist in the DataFrame if the field is an > Option (which may be a better solution) > {code} > case class Bug(field1: String, field2: Option[String]) > Seq(Bug("abc", None)).toDS.write.json("/tmp/sqlBug") > spark.read.json("/tmp/sqlBug").as[Bug] > {code} > stack trace: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve '`field2`' given input > columns: [field1] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:62) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:59) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:287) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:68) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0
[ https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17540: Assignee: Apache Spark > SparkR array serde cannot work correctly when array length == 0 > --- > > Key: SPARK-17540 > URL: https://issues.apache.org/jira/browse/SPARK-17540 > Project: Spark > Issue Type: Bug > Components: Spark Core, SparkR >Affects Versions: 2.1.0 >Reporter: Weichen Xu >Assignee: Apache Spark > > SparkR cannot handle array serde when array length == 0 > when length = 0 > R side set the element type as class("somestring") > so that scala side code receive it as a string array, > but the array we need to transfer may be other types, > it will cause problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0
[ https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17540: Assignee: (was: Apache Spark) > SparkR array serde cannot work correctly when array length == 0 > --- > > Key: SPARK-17540 > URL: https://issues.apache.org/jira/browse/SPARK-17540 > Project: Spark > Issue Type: Bug > Components: Spark Core, SparkR >Affects Versions: 2.1.0 >Reporter: Weichen Xu > > SparkR cannot handle array serde when array length == 0 > when length = 0 > R side set the element type as class("somestring") > so that scala side code receive it as a string array, > but the array we need to transfer may be other types, > it will cause problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0
[ https://issues.apache.org/jira/browse/SPARK-17540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490790#comment-15490790 ] Apache Spark commented on SPARK-17540: -- User 'WeichenXu123' has created a pull request for this issue: https://github.com/apache/spark/pull/15097 > SparkR array serde cannot work correctly when array length == 0 > --- > > Key: SPARK-17540 > URL: https://issues.apache.org/jira/browse/SPARK-17540 > Project: Spark > Issue Type: Bug > Components: Spark Core, SparkR >Affects Versions: 2.1.0 >Reporter: Weichen Xu > > SparkR cannot handle array serde when array length == 0 > when length = 0 > R side set the element type as class("somestring") > so that scala side code receive it as a string array, > but the array we need to transfer may be other types, > it will cause problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-17540) SparkR array serde cannot work correctly when array length == 0
Weichen Xu created SPARK-17540: -- Summary: SparkR array serde cannot work correctly when array length == 0 Key: SPARK-17540 URL: https://issues.apache.org/jira/browse/SPARK-17540 Project: Spark Issue Type: Bug Components: Spark Core, SparkR Affects Versions: 2.1.0 Reporter: Weichen Xu SparkR cannot handle array serde when array length == 0 when length = 0 R side set the element type as class("somestring") so that scala side code receive it as a string array, but the array we need to transfer may be other types, it will cause problem. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17535) Performance Improvement of Signleton pattern in SparkContext
[ https://issues.apache.org/jira/browse/SPARK-17535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490737#comment-15490737 ] Sean Owen commented on SPARK-17535: --- I think this happens to work out because the context is held in an AtomicReference, and it's set as the very last statement in all paths. I am not sure it's worth changing and making this more complex unless there's a compelling reason though, like evidence this is heavily contended. For example, if later something else occurred in the synchronized block after the context was set, you'd be able to get into strange situations where the reference is available but the synchronized init hasn't completed. > Performance Improvement of Signleton pattern in SparkContext > > > Key: SPARK-17535 > URL: https://issues.apache.org/jira/browse/SPARK-17535 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.0.0 >Reporter: WangJianfei > Labels: easyfix, performance > > I think the singleton pattern of SparkContext is inefficient if there are > many request to get the SparkContext, > So we can write the singleton pattern as below,The second way if more > efficient when there are many request to get the SparkContext. > {code} > // the current version > def getOrCreate(): SparkContext = { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if (activeContext.get() == null) { > setActiveContext(new SparkContext(), allowMultipleContexts = false) > } > activeContext.get() > } > } > // by myself > def getOrCreate(): SparkContext = { > if (activeContext.get() == null) { > SPARK_CONTEXT_CONSTRUCTOR_LOCK.synchronized { > if(activeContext.get == null) { > @volatile val sparkContext = new SparkContext() > setActiveContext(sparkContext, allowMultipleContexts = false) > } > } > activeContext.get() > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-17538) sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0
[ https://issues.apache.org/jira/browse/SPARK-17538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-17538: -- Description: I have a production job in spark 1.6.2 that registers four dataframes as tables. After testing the job in spark 2.0.0 one of the dataframes is not getting registered as a table. output of sqlContext.tableNames() just after registering the fourth dataframe in spark 1.6.2 is temp1,temp2,temp3,temp4 output of sqlContext.tableNames() just after registering the fourth dataframe in spark 2.0.0 is temp1,temp2,temp3 so when the table 'temp4' is used by the job at a later stage an AnalysisException is raised in spark 2.0.0 There are no changes in the code whatsoever. was: I have a production job in spark 1.6.2 that registers four dataframes as tables. After testing the job in spark 2.0.0 one of the dataframes is not getting registered as a table. output of sqlContext.tableNames() just after registering the fourth dataframe in spark 1.6.2 is temp1,temp2,temp3,temp4 output of sqlContext.tableNames() just after registering the fourth dataframe in spark 2.0.0 is temp1,temp2,temp3 so when the table temp4 is used by the job at a later stage an AnalysisException is raised in spark 2.0.0 There are no changes in the code whatsoever. > sqlContext.registerDataFrameAsTable is not working sometimes in pyspark 2.0.0 > - > > Key: SPARK-17538 > URL: https://issues.apache.org/jira/browse/SPARK-17538 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0, 2.0.1, 2.1.0 > Environment: os - linux > cluster -> yarn and local >Reporter: Srinivas Rishindra Pothireddi >Priority: Critical > Fix For: 2.0.1, 2.1.0 > > > I have a production job in spark 1.6.2 that registers four dataframes as > tables. After testing the job in spark 2.0.0 one of the dataframes is not > getting registered as a table. > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 1.6.2 is > temp1,temp2,temp3,temp4 > output of sqlContext.tableNames() just after registering the fourth dataframe > in spark 2.0.0 is > temp1,temp2,temp3 > so when the table 'temp4' is used by the job at a later stage an > AnalysisException is raised in spark 2.0.0 > There are no changes in the code whatsoever. > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17510) Set Streaming MaxRate Independently For Multiple Streams
[ https://issues.apache.org/jira/browse/SPARK-17510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15490746#comment-15490746 ] Jeff Nadler commented on SPARK-17510: - I filed SPARK-17539 for the backpressure bug. > Set Streaming MaxRate Independently For Multiple Streams > > > Key: SPARK-17510 > URL: https://issues.apache.org/jira/browse/SPARK-17510 > Project: Spark > Issue Type: Improvement > Components: Streaming >Affects Versions: 2.0.0 >Reporter: Jeff Nadler > > We use multiple DStreams coming from different Kafka topics in a Streaming > application. > Some settings like maxrate and backpressure enabled/disabled would be better > passed as config to KafkaUtils.createStream and > KafkaUtils.createDirectStream, instead of setting them in SparkConf. > Being able to set a different maxrate for different streams is an important > requirement for us; we currently work-around the problem by using one > receiver-based stream and one direct stream. > We would like to be able to turn on backpressure for only one of the streams > as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org