[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean
[ https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19751: Assignee: (was: Apache Spark) > Create Data frame API fails with a self referencing bean > > > Key: SPARK-19751 > URL: https://issues.apache.org/jira/browse/SPARK-19751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Avinash Venkateshaiah >Priority: Minor > > createDataset API throws a stack overflow exception when we try creating a > Dataset using a bean encoder. The bean is self referencing > BEAN: > public class HierObj implements Serializable { > String name; > List children; > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > public List getChildren() { > return children; > } > public void setChildren(List children) { > this.children = children; > } > } > // create an object > HierObj hierObj = new HierObj(); > hierObj.setName("parent"); > List children = new ArrayList(); > HierObj child1 = new HierObj(); > child1.setName("child1"); > HierObj child2 = new HierObj(); > child2.setName("child2"); > children.add(child1); > children.add(child2); > hierObj.setChildren(children); > // create a dataset > Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), > Encoders.bean(HierObj.class)); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19751) Create Data frame API fails with a self referencing bean
[ https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19751: Assignee: Apache Spark > Create Data frame API fails with a self referencing bean > > > Key: SPARK-19751 > URL: https://issues.apache.org/jira/browse/SPARK-19751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Avinash Venkateshaiah >Assignee: Apache Spark >Priority: Minor > > createDataset API throws a stack overflow exception when we try creating a > Dataset using a bean encoder. The bean is self referencing > BEAN: > public class HierObj implements Serializable { > String name; > List children; > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > public List getChildren() { > return children; > } > public void setChildren(List children) { > this.children = children; > } > } > // create an object > HierObj hierObj = new HierObj(); > hierObj.setName("parent"); > List children = new ArrayList(); > HierObj child1 = new HierObj(); > child1.setName("child1"); > HierObj child2 = new HierObj(); > child2.setName("child2"); > children.add(child1); > children.add(child2); > hierObj.setChildren(children); > // create a dataset > Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), > Encoders.bean(HierObj.class)); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19751) Create Data frame API fails with a self referencing bean
[ https://issues.apache.org/jira/browse/SPARK-19751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898967#comment-15898967 ] Apache Spark commented on SPARK-19751: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17188 > Create Data frame API fails with a self referencing bean > > > Key: SPARK-19751 > URL: https://issues.apache.org/jira/browse/SPARK-19751 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Avinash Venkateshaiah >Priority: Minor > > createDataset API throws a stack overflow exception when we try creating a > Dataset using a bean encoder. The bean is self referencing > BEAN: > public class HierObj implements Serializable { > String name; > List children; > public String getName() { > return name; > } > public void setName(String name) { > this.name = name; > } > public List getChildren() { > return children; > } > public void setChildren(List children) { > this.children = children; > } > } > // create an object > HierObj hierObj = new HierObj(); > hierObj.setName("parent"); > List children = new ArrayList(); > HierObj child1 = new HierObj(); > child1.setName("child1"); > HierObj child2 = new HierObj(); > child2.setName("child2"); > children.add(child1); > children.add(child2); > hierObj.setChildren(children); > // create a dataset > Dataset ds = sparkSession().createDataset(Arrays.asList(hierObj), > Encoders.bean(HierObj.class)); -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19848) Regex Support in StopWordsRemover
Mohd Suaib Danish created SPARK-19848: - Summary: Regex Support in StopWordsRemover Key: SPARK-19848 URL: https://issues.apache.org/jira/browse/SPARK-19848 Project: Spark Issue Type: Wish Components: ML Affects Versions: 2.1.0 Reporter: Mohd Suaib Danish Can we have regex feature in StopWordsRemover in addition to the provided list of stop words? Use cases can be following: 1. Remove all single or double letter words- [a-zA-Z]{1,2} 2. Remove anything starting with abc - ^abc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19848) Regex Support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mohd Suaib Danish updated SPARK-19848: -- Description: Can we have regex feature in StopWordsRemover in addition to the provided list of stop words? Use cases can be following: 1. Remove all single or double letter words- [a-zA-Z]{1,2} 2. Remove anything starting with abc - ^abc was: Can we have regex feature in StopWordsRemover in addition to the provided list of stop words? Use cases can be following: 1. Remove all single or double letter words- [a-zA-Z]{1,2} 2. Remove anything starting with abc - ^abc > Regex Support in StopWordsRemover > - > > Key: SPARK-19848 > URL: https://issues.apache.org/jira/browse/SPARK-19848 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.0 >Reporter: Mohd Suaib Danish > > Can we have regex feature in StopWordsRemover in addition to the provided > list of stop words? > Use cases can be following: > 1. Remove all single or double letter words- [a-zA-Z]{1,2} > 2. Remove anything starting with abc - ^abc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19829) The log about driver should support rolling like executor
[ https://issues.apache.org/jira/browse/SPARK-19829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899228#comment-15899228 ] Sean Owen commented on SPARK-19829: --- Why wouldn't log4j be a good solution here? its purpose is managing logs and can define rolling appenders. > The log about driver should support rolling like executor > - > > Key: SPARK-19829 > URL: https://issues.apache.org/jira/browse/SPARK-19829 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: hustfxj >Priority: Minor > > We should rollback the log of the driver , or the log maybe large!!! > {code:title=DriverRunner.java|borderStyle=solid} > // modify the runDriver > private def runDriver(builder: ProcessBuilder, baseDir: File, supervise: > Boolean): Int = { > builder.directory(baseDir) > def initialize(process: Process): Unit = { > // Redirect stdout and stderr to files-- the old code > // val stdout = new File(baseDir, "stdout") > // CommandUtils.redirectStream(process.getInputStream, stdout) > // > // val stderr = new File(baseDir, "stderr") > // val formattedCommand = builder.command.asScala.mkString("\"", "\" > \"", "\"") > // val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, > "=" * 40) > // Files.append(header, stderr, StandardCharsets.UTF_8) > // CommandUtils.redirectStream(process.getErrorStream, stderr) > // Redirect its stdout and stderr to files-support rolling > val stdout = new File(baseDir, "stdout") > stdoutAppender = FileAppender(process.getInputStream, stdout, conf) > val stderr = new File(baseDir, "stderr") > val formattedCommand = builder.command.asScala.mkString("\"", "\" \"", > "\"") > val header = "Launch Command: %s\n%s\n\n".format(formattedCommand, "=" > * 40) > Files.append(header, stderr, StandardCharsets.UTF_8) > stderrAppender = FileAppender(process.getErrorStream, stderr, conf) > } > runCommandWithRetry(ProcessBuilderLike(builder), initialize, supervise) > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19810: -- Target Version/s: 2.3.0 > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-14220: -- Affects Version/s: 2.1.0 Target Version/s: 2.3.0 > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19848) Regex Support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-19848: -- Priority: Minor (was: Major) Issue Type: Improvement (was: Wish) If it's more of a question, ask on the mailing list. It's not unreasonable, but what's the use case? stopwords are, well, words and not classes of words. You can always filter text in your own code as you like, and this is not a general-purpose tool for removing strings. > Regex Support in StopWordsRemover > - > > Key: SPARK-19848 > URL: https://issues.apache.org/jira/browse/SPARK-19848 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Mohd Suaib Danish >Priority: Minor > > Can we have regex feature in StopWordsRemover in addition to the provided > list of stop words? > Use cases can be following: > 1. Remove all single or double letter words- [a-zA-Z]{1,2} > 2. Remove anything starting with abc - ^abc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19848) Regex Support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250 ] Nick Pentreath edited comment on SPARK-19848 at 3/7/17 11:06 AM: - This seems more like a "token filter" in a sense. Perhaps the ML pipeline components mentioned [here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be of use? was (Author: mlnick): Perhaps the ML pipeline components mentioned [here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be of use? > Regex Support in StopWordsRemover > - > > Key: SPARK-19848 > URL: https://issues.apache.org/jira/browse/SPARK-19848 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Mohd Suaib Danish >Priority: Minor > > Can we have regex feature in StopWordsRemover in addition to the provided > list of stop words? > Use cases can be following: > 1. Remove all single or double letter words- [a-zA-Z]{1,2} > 2. Remove anything starting with abc - ^abc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19848) Regex Support in StopWordsRemover
[ https://issues.apache.org/jira/browse/SPARK-19848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899250#comment-15899250 ] Nick Pentreath commented on SPARK-19848: Perhaps the ML pipeline components mentioned [here|https://lucidworks.com/2016/04/13/spark-solr-lucenetextanalyzer/] may be of use? > Regex Support in StopWordsRemover > - > > Key: SPARK-19848 > URL: https://issues.apache.org/jira/browse/SPARK-19848 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Mohd Suaib Danish >Priority: Minor > > Can we have regex feature in StopWordsRemover in addition to the provided > list of stop words? > Use cases can be following: > 1. Remove all single or double letter words- [a-zA-Z]{1,2} > 2. Remove anything starting with abc - ^abc -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19836) Customizable remote repository url for hive versions unit test
[ https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899292#comment-15899292 ] Song Jun commented on SPARK-19836: -- I have do this similar https://github.com/apache/spark/pull/16803 > Customizable remote repository url for hive versions unit test > -- > > Key: SPARK-19836 > URL: https://issues.apache.org/jira/browse/SPARK-19836 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Elek, Marton > Labels: ivy, unittest > > When the VersionSuite test runs from sql/hive it downloads different versions > from hive. > Unfortunately the IsolatedClientClassloader (which is used by the > VersionSuite) uses hardcoded fix repositories: > {code} > val classpath = quietly { > SparkSubmitUtils.resolveMavenCoordinates( > hiveArtifacts.mkString(","), > SparkSubmitUtils.buildIvySettings( > Some("http://www.datanucleus.org/downloads/maven2";), > ivyPath), > exclusions = version.exclusions) > } > {code} > The problem is with the hard-coded repositories: > 1. it's hard to run unit tests in an environment where only one internal > maven repository is available (and central/datanucleus is not) > 2. it's impossible to run unit tests against custom built hive/hadoop > artifacts (which are not available from the central repository) > VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment > variable to define a custom local repository as ivy cache. > I suggest to add an additional environment variable > (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to > make it possible adding new remote repositories for testing the different > hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19836) Customizable remote repository url for hive versions unit test
[ https://issues.apache.org/jira/browse/SPARK-19836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-19836. --- Resolution: Duplicate > Customizable remote repository url for hive versions unit test > -- > > Key: SPARK-19836 > URL: https://issues.apache.org/jira/browse/SPARK-19836 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.1.0 >Reporter: Elek, Marton > Labels: ivy, unittest > > When the VersionSuite test runs from sql/hive it downloads different versions > from hive. > Unfortunately the IsolatedClientClassloader (which is used by the > VersionSuite) uses hardcoded fix repositories: > {code} > val classpath = quietly { > SparkSubmitUtils.resolveMavenCoordinates( > hiveArtifacts.mkString(","), > SparkSubmitUtils.buildIvySettings( > Some("http://www.datanucleus.org/downloads/maven2";), > ivyPath), > exclusions = version.exclusions) > } > {code} > The problem is with the hard-coded repositories: > 1. it's hard to run unit tests in an environment where only one internal > maven repository is available (and central/datanucleus is not) > 2. it's impossible to run unit tests against custom built hive/hadoop > artifacts (which are not available from the central repository) > VersionSuite has already a specific SPARK_VERSIONS_SUITE_IVY_PATH environment > variable to define a custom local repository as ivy cache. > I suggest to add an additional environment variable > (SPARK_VERSIONS_SUITE_IVY_REPOSITORIES to the HiveClientBuilder.scala), to > make it possible adding new remote repositories for testing the different > hive versions. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19831: Assignee: Apache Spark > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Assignee: Apache Spark >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message timely. So it had better send the > heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19831: Assignee: (was: Apache Spark) > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message timely. So it had better send the > heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19831) Sending the heartbeat master from worker maybe blocked by other rpc messages
[ https://issues.apache.org/jira/browse/SPARK-19831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899334#comment-15899334 ] Apache Spark commented on SPARK-19831: -- User 'hustfxj' has created a pull request for this issue: https://github.com/apache/spark/pull/17189 > Sending the heartbeat master from worker maybe blocked by other rpc messages > -- > > Key: SPARK-19831 > URL: https://issues.apache.org/jira/browse/SPARK-19831 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: hustfxj >Priority: Minor > > Cleaning the application may cost much time at worker, then it will block > that the worker send heartbeats master because the worker is extend > *ThreadSafeRpcEndpoint*. If the heartbeat from a worker is blocked by the > message *ApplicationFinished*, master will think the worker is dead. If the > worker has a driver, the driver will be scheduled by master again. So I think > it is the bug on spark. It may solve this problem by the followed suggests: > 1. It had better put the cleaning the application in a single asynchronous > thread like 'cleanupThreadExecutor'. Thus it won't block other rpc messages > like *SendHeartbeat*; > 2. It had better not receive the heartbeat master by *receive* method. > Because any other rpc message may block the *receive* method. Then worker > won't receive the heartbeat message timely. So it had better send the > heartbeat master at an asynchronous timing thread . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19478) JDBC Sink
[ https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19478: Assignee: (was: Apache Spark) > JDBC Sink > - > > Key: SPARK-19478 > URL: https://issues.apache.org/jira/browse/SPARK-19478 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Michael Armbrust > > A sink that transactionally commits data into a database use JDBC. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19478) JDBC Sink
[ https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19478: Assignee: Apache Spark > JDBC Sink > - > > Key: SPARK-19478 > URL: https://issues.apache.org/jira/browse/SPARK-19478 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Michael Armbrust >Assignee: Apache Spark > > A sink that transactionally commits data into a database use JDBC. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19478) JDBC Sink
[ https://issues.apache.org/jira/browse/SPARK-19478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899458#comment-15899458 ] Apache Spark commented on SPARK-19478: -- User 'GaalDornick' has created a pull request for this issue: https://github.com/apache/spark/pull/17190 > JDBC Sink > - > > Key: SPARK-19478 > URL: https://issues.apache.org/jira/browse/SPARK-19478 > Project: Spark > Issue Type: New Feature > Components: Structured Streaming >Affects Versions: 2.0.0 >Reporter: Michael Armbrust > > A sink that transactionally commits data into a database use JDBC. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19849) Support ArrayType in to_json function/expression
Hyukjin Kwon created SPARK-19849: Summary: Support ArrayType in to_json function/expression Key: SPARK-19849 URL: https://issues.apache.org/jira/browse/SPARK-19849 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.0 Reporter: Hyukjin Kwon After SPARK-19595, we could {code} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show() {code} Maybe, it'd be better if we provide a way to read it back via {{to_json}} as below: {code} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types._ val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array")) df.show() // Read back. df.select(to_json($"array").as("json")).show() {code} {code} +--+ | array| +--+ |[[1], [2]]| +--+ +-+ | json| +-+ |[{"a":1},{"a":2}]| +-+ {code} Currently, it throws an exception as below: {code} org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type mismatch: structtojson requires that the expression is a struct expression.;; 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45] +- Project [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), array#27, Some(Asia/Seoul)) AS array#30] +- Project [value#25 AS array#27] +- LocalRelation [value#25] at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72) {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19850) Support aliased expressions in function parameters
Herman van Hovell created SPARK-19850: - Summary: Support aliased expressions in function parameters Key: SPARK-19850 URL: https://issues.apache.org/jira/browse/SPARK-19850 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.1.0 Reporter: Herman van Hovell Assignee: Herman van Hovell The SQL parser currently does not allow a user to pass an aliased expression as function parameter. This can be useful if we want to create a struct. For example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19850) Support aliased expressions in function parameters
[ https://issues.apache.org/jira/browse/SPARK-19850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hovell updated SPARK-19850: -- Issue Type: Improvement (was: Bug) > Support aliased expressions in function parameters > -- > > Key: SPARK-19850 > URL: https://issues.apache.org/jira/browse/SPARK-19850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Herman van Hovell >Assignee: Herman van Hovell > > The SQL parser currently does not allow a user to pass an aliased expression > as function parameter. This can be useful if we want to create a struct. For > example {{select struct(a + 1 as c, b + 4 as d) from tbl_x}} would create a > struct with columns {{c}} and {{d}}, instead {{col1}} and {{col2}}. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions
[ https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899584#comment-15899584 ] Apache Spark commented on SPARK-14471: -- User 'maropu' has created a pull request for this issue: https://github.com/apache/spark/pull/17191 > The alias created in SELECT could be used in GROUP BY and followed expressions > -- > > Key: SPARK-14471 > URL: https://issues.apache.org/jira/browse/SPARK-14471 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > This query should be able to run: > {code} > select a a1, a1 + 1 as b, count(1) from t group by a1 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions
[ https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14471: Assignee: (was: Apache Spark) > The alias created in SELECT could be used in GROUP BY and followed expressions > -- > > Key: SPARK-14471 > URL: https://issues.apache.org/jira/browse/SPARK-14471 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu > > This query should be able to run: > {code} > select a a1, a1 + 1 as b, count(1) from t group by a1 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14471) The alias created in SELECT could be used in GROUP BY and followed expressions
[ https://issues.apache.org/jira/browse/SPARK-14471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-14471: Assignee: Apache Spark > The alias created in SELECT could be used in GROUP BY and followed expressions > -- > > Key: SPARK-14471 > URL: https://issues.apache.org/jira/browse/SPARK-14471 > Project: Spark > Issue Type: New Feature > Components: SQL >Reporter: Davies Liu >Assignee: Apache Spark > > This query should be able to run: > {code} > select a a1, a1 + 1 as b, count(1) from t group by a1 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15463) Support for creating a dataframe from CSV in Dataset[String]
[ https://issues.apache.org/jira/browse/SPARK-15463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899665#comment-15899665 ] Jayesh lalwani commented on SPARK-15463: Does it make sense to have a to_csv and a from_csv function that is modeled after to_json and from_json? The applications that we are supporting need inputs from a combination of sources and formats. Also, there is a combination of sinks and formats. For example, we might need a) Files with CSV content b) Files with JSON content c) Kafka with CSV content d) Kafka with JSON content e) Parquet Also, if the input has a nested structure (JSON/Parquet) sometimes, we prefer keeping the data in a StructType object.. and sometimes we prefer to flatten the Struct Type object into a dataframe. For example, if we are getting data from Kafka as JSON, massaging it, and are writing JSON to Kafka, we would prefer to be able to transform a StructType object, and not have to flatten it into a dataframe Another example is that we get data from JSON, that needs to be stored in an RDMBS database. This requires us to flatten the data into a data frame before storing it into the table So, this is what I was thinking. We should have the following functions 1) from_json - Convert a Dataframe with String to DataFrame with StructType 2) to_json - Convert a Dataframe with StructType to Dataframe of String 3) from_csv - Convert a Dataframe of String to DataFrame of StructType 4) to_csv - COnvert a DataFrame of StructType to DataFrame of String 5) flatten - convert Data Frame with StructType into a DataFrame that has the same fields as the StructType Essentially, the request in the Change Request can be done by calling *flatten(from_csv())* > Support for creating a dataframe from CSV in Dataset[String] > > > Key: SPARK-15463 > URL: https://issues.apache.org/jira/browse/SPARK-15463 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0 >Reporter: PJ Fanning > > I currently use Databrick's spark-csv lib but some features don't work with > Apache Spark 2.0.0-SNAPSHOT. I understand that with the addition of CSV > support into spark-sql directly, that spark-csv won't be modified. > I currently read some CSV data that has been pre-processed and is in > RDD[String] format. > There is sqlContext.read.json(rdd: RDD[String]) but other formats don't > appear to support the creation of DataFrames based on loading from > RDD[String]. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19849) Support ArrayType in to_json function/expression
[ https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19849: Assignee: (was: Apache Spark) > Support ArrayType in to_json function/expression > > > Key: SPARK-19849 > URL: https://issues.apache.org/jira/browse/SPARK-19849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > After SPARK-19595, we could > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > Seq(("""[{"a": 1}, {"a": > 2}]""")).toDF("array").select(from_json(col("array"), schema)).show() > {code} > Maybe, it'd be better if we provide a way to read it back via {{to_json}} as > below: > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", > schema).as("array")) > df.show() > // Read back. > df.select(to_json($"array").as("json")).show() > {code} > {code} > +--+ > | array| > +--+ > |[[1], [2]]| > +--+ > +-+ > | json| > +-+ > |[{"a":1},{"a":2}]| > +-+ > {code} > Currently, it throws an exception as below: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'structtojson(`array`)' due to data type mismatch: structtojson requires that > the expression is a struct expression.;; > 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45] > +- Project > [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), > array#27, Some(Asia/Seoul)) AS array#30] >+- Project [value#25 AS array#27] > +- LocalRelation [value#25] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19849) Support ArrayType in to_json function/expression
[ https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899705#comment-15899705 ] Apache Spark commented on SPARK-19849: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17192 > Support ArrayType in to_json function/expression > > > Key: SPARK-19849 > URL: https://issues.apache.org/jira/browse/SPARK-19849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > After SPARK-19595, we could > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > Seq(("""[{"a": 1}, {"a": > 2}]""")).toDF("array").select(from_json(col("array"), schema)).show() > {code} > Maybe, it'd be better if we provide a way to read it back via {{to_json}} as > below: > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", > schema).as("array")) > df.show() > // Read back. > df.select(to_json($"array").as("json")).show() > {code} > {code} > +--+ > | array| > +--+ > |[[1], [2]]| > +--+ > +-+ > | json| > +-+ > |[{"a":1},{"a":2}]| > +-+ > {code} > Currently, it throws an exception as below: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'structtojson(`array`)' due to data type mismatch: structtojson requires that > the expression is a struct expression.;; > 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45] > +- Project > [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), > array#27, Some(Asia/Seoul)) AS array#30] >+- Project [value#25 AS array#27] > +- LocalRelation [value#25] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19849) Support ArrayType in to_json function/expression
[ https://issues.apache.org/jira/browse/SPARK-19849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19849: Assignee: Apache Spark > Support ArrayType in to_json function/expression > > > Key: SPARK-19849 > URL: https://issues.apache.org/jira/browse/SPARK-19849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark > > After SPARK-19595, we could > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > Seq(("""[{"a": 1}, {"a": > 2}]""")).toDF("array").select(from_json(col("array"), schema)).show() > {code} > Maybe, it'd be better if we provide a way to read it back via {{to_json}} as > below: > {code} > import org.apache.spark.sql.functions._ > import org.apache.spark.sql.types._ > val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil)) > val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", > schema).as("array")) > df.show() > // Read back. > df.select(to_json($"array").as("json")).show() > {code} > {code} > +--+ > | array| > +--+ > |[[1], [2]]| > +--+ > +-+ > | json| > +-+ > |[{"a":1},{"a":2}]| > +-+ > {code} > Currently, it throws an exception as below: > {code} > org.apache.spark.sql.AnalysisException: cannot resolve > 'structtojson(`array`)' due to data type mismatch: structtojson requires that > the expression is a struct expression.;; > 'Project [structtojson(array#30, Some(Asia/Seoul)) AS structtojson(array)#45] > +- Project > [jsontostruct(ArrayType(StructType(StructField(a,IntegerType,true)),true), > array#27, Some(Asia/Seoul)) AS array#30] >+- Project [value#25 AS array#27] > +- LocalRelation [value#25] > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:80) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:72) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19637) add to_json APIs to SQL
[ https://issues.apache.org/jira/browse/SPARK-19637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19637. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.2.0 > add to_json APIs to SQL > --- > > Key: SPARK-19637 > URL: https://issues.apache.org/jira/browse/SPARK-19637 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.1.0 >Reporter: Burak Yavuz >Assignee: Takeshi Yamamuro > Fix For: 2.2.0 > > > The method "to_json" is a useful method in turning a struct into a json > string. It currently doesn't work in SQL, but adding it should be trivial. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19840) Disallow creating permanent functions with invalid class names
[ https://issues.apache.org/jira/browse/SPARK-19840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-19840: -- Description: Currently, Spark raises exceptions on creating invalid **temporary** functions, but doesn't for **permanent** functions. This issue aims to disallow creating permanent functions with invalid class names. *BEFORE* {code} scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid at ... scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show ++ || ++ ++ scala> sql("show functions like 'function_*'").show(false) +---+ |function | +---+ |default.function_with_invalid_classname| +---+ scala> sql("select function_with_invalid_classname()").show org.apache.spark.sql.AnalysisException: Undefined function: 'function_with_invalid_classname'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 7 {code} *AFTER* {code} scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid {code} was: Currently, Spark raises exceptions on creating invalid **temporary** functions, but doesn't for **permanent** functions. This issue aims to disallow creating permanent functions with invalid class names. **BEFORE** {code} scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid at ... scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show ++ || ++ ++ {code} **AFTER** {code} scala> sql("CREATE FUNCTION function_with_invalid_classname AS 'org.invalid'").show java.lang.ClassNotFoundException: org.invalid {code} > Disallow creating permanent functions with invalid class names > -- > > Key: SPARK-19840 > URL: https://issues.apache.org/jira/browse/SPARK-19840 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Dongjoon Hyun > > Currently, Spark raises exceptions on creating invalid **temporary** > functions, but doesn't for **permanent** functions. This issue aims to > disallow creating permanent functions with invalid class names. > *BEFORE* > {code} > scala> sql("CREATE TEMPORARY FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid at > ... > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > ++ > || > ++ > ++ > scala> sql("show functions like 'function_*'").show(false) > +---+ > |function | > +---+ > |default.function_with_invalid_classname| > +---+ > scala> sql("select function_with_invalid_classname()").show > org.apache.spark.sql.AnalysisException: Undefined function: > 'function_with_invalid_classname'. This function is neither a registered > temporary function nor a permanent function registered in the database > 'default'.; line 1 pos 7 > {code} > *AFTER* > {code} > scala> sql("CREATE FUNCTION function_with_invalid_classname AS > 'org.invalid'").show > java.lang.ClassNotFoundException: org.invalid > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19765: Labels: release_notes (was: ) > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-19765: Description: DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all the cached plans that refer to this table > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19765) UNCACHE TABLE should also un-cache all cached plans that refer to this table
[ https://issues.apache.org/jira/browse/SPARK-19765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-19765. - Resolution: Fixed Fix Version/s: 2.2.0 > UNCACHE TABLE should also un-cache all cached plans that refer to this table > > > Key: SPARK-19765 > URL: https://issues.apache.org/jira/browse/SPARK-19765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Labels: release_notes > Fix For: 2.2.0 > > > DropTableCommand, TruncateTableCommand, AlterTableRenameCommand, > UncacheTableCommand, RefreshTable and InsertIntoHiveTable will un-cache all > the cached plans that refer to this table -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-18549) Failed to Uncache a View that References a Dropped Table.
[ https://issues.apache.org/jira/browse/SPARK-18549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-18549. - Resolution: Fixed Assignee: Wenchen Fan Fix Version/s: 2.2.0 > Failed to Uncache a View that References a Dropped Table. > - > > Key: SPARK-18549 > URL: https://issues.apache.org/jira/browse/SPARK-18549 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2 >Reporter: Xiao Li >Assignee: Wenchen Fan >Priority: Critical > Fix For: 2.2.0 > > > {code} > spark.range(1, 10).toDF("id1").write.format("json").saveAsTable("jt1") > spark.range(1, 10).toDF("id2").write.format("json").saveAsTable("jt2") > sql("CREATE VIEW testView AS SELECT * FROM jt1 JOIN jt2 ON id1 == id2") > // Cache is empty at the beginning > assert(spark.sharedState.cacheManager.isEmpty) > sql("CACHE TABLE testView") > assert(spark.catalog.isCached("testView")) > // Cache is not empty > assert(!spark.sharedState.cacheManager.isEmpty) > {code} > {code} > // drop a table referenced by a cached view > sql("DROP TABLE jt1") > -- So far everything is fine > // Failed to unache the view > val e = intercept[AnalysisException] { > sql("UNCACHE TABLE testView") > }.getMessage > assert(e.contains("Table or view not found: `default`.`jt1`")) > // We are unable to drop it from the cache > assert(!spark.sharedState.cacheManager.isEmpty) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899964#comment-15899964 ] Ari Gesher commented on SPARK-19764: We narrowed this down to driver OOM that wasn't being properly propagated into our Jupyter Notebook. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher resolved SPARK-19764. Resolution: Not A Bug > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
Michael Styles created SPARK-19851: -- Summary: Add support for EVERY and ANY (SOME) aggregates Key: SPARK-19851 URL: https://issues.apache.org/jira/browse/SPARK-19851 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core, SQL Affects Versions: 2.1.0 Reporter: Michael Styles -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19348) pyspark.ml.Pipeline gets corrupted under multi threaded use
[ https://issues.apache.org/jira/browse/SPARK-19348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899979#comment-15899979 ] Apache Spark commented on SPARK-19348: -- User 'BryanCutler' has created a pull request for this issue: https://github.com/apache/spark/pull/17193 > pyspark.ml.Pipeline gets corrupted under multi threaded use > --- > > Key: SPARK-19348 > URL: https://issues.apache.org/jira/browse/SPARK-19348 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.6.0, 2.0.0, 2.1.0, 2.2.0 >Reporter: Vinayak Joshi >Assignee: Bryan Cutler > Fix For: 2.2.0 > > Attachments: pyspark_pipeline_threads.py > > > When pyspark.ml.Pipeline objects are constructed concurrently in separate > python threads, it is observed that the stages used to construct a pipeline > object get corrupted i.e the stages supplied to a Pipeline object in one > thread appear inside a different Pipeline object constructed in a different > thread. > Things work fine if construction of pyspark.ml.Pipeline objects is > serialized, so this looks like a thread safety problem with > pyspark.ml.Pipeline object construction. > Confirmed that the problem exists with Spark 1.6.x as well as 2.x. > While the corruption of the Pipeline stages is easily caught, we need to know > if performing other pipeline operations, such as pyspark.ml.pipeline.fit( ) > are also affected by the underlying cause of this problem. That is, whether > other pipeline operations like pyspark.ml.pipeline.fit( ) may be performed > in separate threads (on distinct pipeline objects) concurrently without any > cross contamination between them. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899980#comment-15899980 ] Shixiong Zhu commented on SPARK-19764: -- [~agesher] Do you have the OOM stack trace? So that we can fix it. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899988#comment-15899988 ] Michael Styles commented on SPARK-19851: https://github.com/apache/spark/pull/17194 > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17498) StringIndexer.setHandleInvalid should have another option 'new'
[ https://issues.apache.org/jira/browse/SPARK-17498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-17498. --- Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16883 [https://github.com/apache/spark/pull/16883] > StringIndexer.setHandleInvalid should have another option 'new' > --- > > Key: SPARK-17498 > URL: https://issues.apache.org/jira/browse/SPARK-17498 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Miroslav Balaz >Assignee: Vincent >Priority: Minor > Fix For: 2.2.0 > > > That will map unseen label to maximum known label +1, IndexToString would map > that back to "" or NA if there is something like that in spark, -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Styles updated SPARK-19851: --- Description: Add support for EVERY and ANY (SOME) aggregates. - EVERY returns true if all input values are true. - ANY returns true if at least one input value is true. - SOME is equivalent to ANY. Both aggregates are part of the SQL standard. > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
Joseph K. Bradley created SPARK-19852: - Summary: StringIndexer.setHandleInvalid should have another option 'new': Python API and docs Key: SPARK-19852 URL: https://issues.apache.org/jira/browse/SPARK-19852 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Joseph K. Bradley Priority: Minor Update Python API for StringIndexer so setHandleInvalid doc is correct. This will probably require: * putting HandleInvalid within StringIndexer to update its built-in doc (See Bucketizer for an example.) * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1599#comment-1599 ] Ari Gesher commented on SPARK-19764: We were collecting more data than we had heap for. Still useful? > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19516) update public doc to use SparkSession instead of SparkContext
[ https://issues.apache.org/jira/browse/SPARK-19516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-19516. - Resolution: Fixed Fix Version/s: 2.2.0 Issue resolved by pull request 16856 [https://github.com/apache/spark/pull/16856] > update public doc to use SparkSession instead of SparkContext > - > > Key: SPARK-19516 > URL: https://issues.apache.org/jira/browse/SPARK-19516 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 2.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout resolved SPARK-19803. Resolution: Fixed Assignee: Genmao Yu Fix Version/s: 2.2.0 Thanks for fixing this [~uncleGen] and for reporting it [~sitalke...@gmail.com] > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Component/s: Tests > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Affects Version/s: (was: 2.3.0) 2.2.0 > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19803) Flaky BlockManagerProactiveReplicationSuite tests
[ https://issues.apache.org/jira/browse/SPARK-19803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kay Ousterhout updated SPARK-19803: --- Labels: flaky-test (was: ) > Flaky BlockManagerProactiveReplicationSuite tests > - > > Key: SPARK-19803 > URL: https://issues.apache.org/jira/browse/SPARK-19803 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 2.2.0 >Reporter: Sital Kedia >Assignee: Genmao Yu > Labels: flaky-test > Fix For: 2.2.0 > > > The tests added for BlockManagerProactiveReplicationSuite has made the > jenkins build flaky. Please refer to the build for more details - > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73640/testReport/ -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19851) Add support for EVERY and ANY (SOME) aggregates
[ https://issues.apache.org/jira/browse/SPARK-19851?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19851: - Component/s: (was: Spark Core) > Add support for EVERY and ANY (SOME) aggregates > --- > > Key: SPARK-19851 > URL: https://issues.apache.org/jira/browse/SPARK-19851 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Michael Styles > > Add support for EVERY and ANY (SOME) aggregates. > - EVERY returns true if all input values are true. > - ANY returns true if at least one input value is true. > - SOME is equivalent to ANY. > Both aggregates are part of the SQL standard. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900127#comment-15900127 ] Shixiong Zhu commented on SPARK-19764: -- So you don't set an UncaughtExceptionHandler and this OOM happened on the driver side? If so, then it's not a Spark bug. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900152#comment-15900152 ] Chris Rogers commented on SPARK-16207: -- The lack of documentation on this is immensely confusing. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900162#comment-15900162 ] Sean Owen commented on SPARK-16207: --- [~rcrogers] where would you document this? we could add a document in a place that would have helped you. However as I say, lots of things don't preserve order, so I don't know if it's sensible to write that everywhere. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
[ https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu resolved SPARK-19561. Resolution: Fixed Assignee: Jason White Fix Version/s: 2.2.0 2.1.1 > Pyspark Dataframes don't allow timestamps near epoch > > > Key: SPARK-19561 > URL: https://issues.apache.org/jira/browse/SPARK-19561 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.1.1, 2.2.0 > > > Pyspark does not allow timestamps at or near the epoch to be created in a > DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299 > TimestampType.toInternal converts a datetime object to a number representing > microseconds since the epoch. For all times more than 2148 seconds before or > after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J > automatically serializes it as a long. > However, for times within this range (~35 minutes before or after the epoch), > Py4J serializes it as an int. When creating the object on the Scala side, > ints are not recognized and the value goes to null. This leads to null values > in non-nullable fields, and corrupted Parquet files. > The solution is trivial - force TimestampType.toInternal to always return a > long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900184#comment-15900184 ] Nick Afshartous commented on SPARK-19767: - Yes, I completed the steps in the Prerequisites section of https://github.com/apache/spark/blob/master/docs/README.md and got the same error about unknown tag {{include_example}} on two different computers (Linux and OSX). Looks like {{include_example}} is local in {{./docs/_plugins/include_example.rb}}, so maybe this is some kind of path issue where its not finding the local file ? > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900207#comment-15900207 ] Sean Owen commented on SPARK-19767: --- Oh, are you not running from the {{docs/}} directory? > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19702) Increasse refuse_seconds timeout in the Mesos Spark Dispatcher
[ https://issues.apache.org/jira/browse/SPARK-19702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen reassigned SPARK-19702: - Assignee: Michael Gummelt Priority: Minor (was: Major) Fix Version/s: 2.2.0 Resolved by https://github.com/apache/spark/pull/17031 > Increasse refuse_seconds timeout in the Mesos Spark Dispatcher > -- > > Key: SPARK-19702 > URL: https://issues.apache.org/jira/browse/SPARK-19702 > Project: Spark > Issue Type: New Feature > Components: Mesos >Affects Versions: 2.1.0 >Reporter: Michael Gummelt >Assignee: Michael Gummelt >Priority: Minor > Fix For: 2.2.0 > > > Due to the problem described here: > https://issues.apache.org/jira/browse/MESOS-6112, Running > 5 Mesos > frameworks concurrently can result in starvation. For example, running 10 > dispatchers could result in 5 of them getting all the offers, even if they > have no jobs to launch. We must implement increase the refuse_seconds > timeout to solve this problem. Another option would have been to implement > suppress/revive, but that can cause starvation due to the unreliability of > mesos RPC calls. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19767) API Doc pages for Streaming with Kafka 0.10 not current
[ https://issues.apache.org/jira/browse/SPARK-19767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900216#comment-15900216 ] Nick Afshartous commented on SPARK-19767: - Missed that one, thanks. > API Doc pages for Streaming with Kafka 0.10 not current > --- > > Key: SPARK-19767 > URL: https://issues.apache.org/jira/browse/SPARK-19767 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Nick Afshartous >Priority: Minor > > The API docs linked from the Spark Kafka 0.10 Integration page are not > current. For instance, on the page >https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html > the code examples show the new API (i.e. class ConsumerStrategies). However, > following the links > API Docs --> (Scala | Java) > lead to API pages that do not have class ConsumerStrategies) . The API doc > package names also have {code}streaming.kafka{code} as opposed to > {code}streaming.kafka10{code} > as in the code examples on streaming-kafka-0-10-integration.html. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220 ] Chris Rogers commented on SPARK-16207: -- [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16207) order guarantees for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-16207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900220#comment-15900220 ] Chris Rogers edited comment on SPARK-16207 at 3/7/17 9:52 PM: -- [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. I'm new to Scala so not sure if this is practical, but maybe the appropriate methods could be moved to an `RDDPreservesOrdering` class with an implicit conversion, akin to `PairRDDFunctions`? was (Author: rcrogers): [~srowen] since there is no documentation yet, I don't know whether a clear, coherent generalization can be made. I would be happy with "most of the methods DO NOT preserve order, with these specific exceptions", or "most of the methods DO preserve order, with these specific exceptions". Failing a generalization, I'd also be happy with method-by-method documentation of ordering semantics, which seems like a very minimal amount of copy-pasting ("Preserves ordering: yes", "Preserves ordering: no"). Maybe that's a good place to start, since there seems to be some confusion about what the generalization would be. > order guarantees for DataFrames > --- > > Key: SPARK-16207 > URL: https://issues.apache.org/jira/browse/SPARK-16207 > Project: Spark > Issue Type: Documentation > Components: Spark Core >Affects Versions: 1.6.1 >Reporter: Max Moroz >Priority: Minor > > There's no clear explanation in the documentation about what guarantees are > available for the preservation of order in DataFrames. Different blogs, SO > answers, and posts on course websites suggest different things. It would be > good to provide clarity on this. > Examples of questions on which I could not find clarification: > 1) Does groupby() preserve order? > 2) Does take() preserve order? > 3) Is DataFrame guaranteed to have the same order of lines as the text file > it was read from? (Or as the json file, etc.) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
Chris Bowden created SPARK-19853: Summary: Uppercase Kafka topics fail when startingOffsets are SpecificOffsets Key: SPARK-19853 URL: https://issues.apache.org/jira/browse/SPARK-19853 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Chris Bowden Priority: Trivial When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
[ https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bowden updated SPARK-19853: - Description: When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. >From KafkaSourceProvider: {code} val startingOffsets = caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) match { case Some("latest") => LatestOffsets case Some("earliest") => EarliestOffsets case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) case None => LatestOffsets } {code} was:When using the KafkaSource with Structured Streaming, consumer assignments are not what the user expects if startingOffsets is set to an explicit set of topics/partitions in JSON where the topic(s) happen to have uppercase characters. When StartingOffsets is constructed, the original string value from options is transformed toLowerCase to make matching on "earliest" and "latest" case insensitive. However, the toLowerCase json is passed to SpecificOffsets for the terminal condition, so topic names may not be what the user intended by the time assignments are made with the underlying KafkaConsumer. > Uppercase Kafka topics fail when startingOffsets are SpecificOffsets > > > Key: SPARK-19853 > URL: https://issues.apache.org/jira/browse/SPARK-19853 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Chris Bowden >Priority: Trivial > > When using the KafkaSource with Structured Streaming, consumer assignments > are not what the user expects if startingOffsets is set to an explicit set of > topics/partitions in JSON where the topic(s) happen to have uppercase > characters. When StartingOffsets is constructed, the original string value > from options is transformed toLowerCase to make matching on "earliest" and > "latest" case insensitive. However, the toLowerCase json is passed to > SpecificOffsets for the terminal condition, so topic names may not be what > the user intended by the time assignments are made with the underlying > KafkaConsumer. > From KafkaSourceProvider: > {code} > val startingOffsets = > caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) > match { > case Some("latest") => LatestOffsets > case Some("earliest") => EarliestOffsets > case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) > case None => LatestOffsets > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19853) Uppercase Kafka topics fail when startingOffsets are SpecificOffsets
[ https://issues.apache.org/jira/browse/SPARK-19853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-19853: - Target Version/s: 2.2.0 > Uppercase Kafka topics fail when startingOffsets are SpecificOffsets > > > Key: SPARK-19853 > URL: https://issues.apache.org/jira/browse/SPARK-19853 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Chris Bowden >Priority: Trivial > > When using the KafkaSource with Structured Streaming, consumer assignments > are not what the user expects if startingOffsets is set to an explicit set of > topics/partitions in JSON where the topic(s) happen to have uppercase > characters. When StartingOffsets is constructed, the original string value > from options is transformed toLowerCase to make matching on "earliest" and > "latest" case insensitive. However, the toLowerCase json is passed to > SpecificOffsets for the terminal condition, so topic names may not be what > the user intended by the time assignments are made with the underlying > KafkaConsumer. > From KafkaSourceProvider: > {code} > val startingOffsets = > caseInsensitiveParams.get(STARTING_OFFSETS_OPTION_KEY).map(_.trim.toLowerCase) > match { > case Some("latest") => LatestOffsets > case Some("earliest") => EarliestOffsets > case Some(json) => SpecificOffsets(JsonUtils.partitionOffsets(json)) > case None => LatestOffsets > } > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19764) Executors hang with supposedly running task that are really finished.
[ https://issues.apache.org/jira/browse/SPARK-19764?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ari Gesher updated SPARK-19764: --- We're driving everything from Python. It may be a bug that we're not getting the error to propagate up to the notebook - generally, we see exceptions. When we ran the same job from the PySpark shell, we saw the stacktrace, so I'm inclined to point at something in the notebook stop that made it not propagate. We're happy to investigate if you think it's useful. > Executors hang with supposedly running task that are really finished. > - > > Key: SPARK-19764 > URL: https://issues.apache.org/jira/browse/SPARK-19764 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.0.2 > Environment: Ubuntu 16.04 LTS > OpenJDK Runtime Environment (build 1.8.0_121-8u121-b13-0ubuntu1.16.04.2-b13) > Spark 2.0.2 - Spark Cluster Manager >Reporter: Ari Gesher > Attachments: driver-log-stderr.log, executor-2.log, netty-6153.jpg, > SPARK-19764.tgz > > > We've come across a job that won't finish. Running on a six-node cluster, > each of the executors end up with 5-7 tasks that are never marked as > completed. > Here's an excerpt from the web UI: > ||Index ▴||ID||Attempt||Status||Locality Level||Executor ID / Host||Launch > Time||Duration||Scheduler Delay||Task Deserialization Time||GC Time||Result > Serialization Time||Getting Result Time||Peak Execution Memory||Shuffle Read > Size / Records||Errors|| > |105 | 1131 | 0 | SUCCESS |PROCESS_LOCAL |4 / 172.31.24.171 | > 2017/02/27 22:51:36 | 1.9 min | 9 ms | 4 ms | 0.7 s | 2 ms| 6 ms| > 384.1 MB| 90.3 MB / 572 | | > |106| 1168| 0| RUNNING |ANY| 2 / 172.31.16.112| 2017/02/27 > 22:53:25|6.5 h |0 ms| 0 ms| 1 s |0 ms| 0 ms| |384.1 MB > |98.7 MB / 624 | | > However, the Executor reports the task as finished: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > As does the driver log: > {noformat} > 17/02/27 22:53:25 INFO Executor: Running task 106.0 in stage 5.0 (TID 1168) > 17/02/27 22:55:29 INFO Executor: Finished task 106.0 in stage 5.0 (TID 1168). > 2633558 bytes result sent via BlockManager) > {noformat} > Full log from this executor and the {{stderr}} from > {{app-20170227223614-0001/2/stderr}} attached. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18138) More officially deprecate support for Python 2.6, Java 7, and Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-18138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-18138: Labels: releasenotes (was: ) > More officially deprecate support for Python 2.6, Java 7, and Scala 2.10 > > > Key: SPARK-18138 > URL: https://issues.apache.org/jira/browse/SPARK-18138 > Project: Spark > Issue Type: Task >Reporter: Reynold Xin >Assignee: Sean Owen >Priority: Blocker > Labels: releasenotes > Fix For: 2.1.0 > > > Plan: > - Mark it very explicit in Spark 2.1.0 that support for the aforementioned > environments are deprecated. > - Remove support it Spark 2.2.0 > Also see mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-tp19553p19577.html -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19854) Refactor file partitioning strategy to make it easier to extend / unit test
Reynold Xin created SPARK-19854: --- Summary: Refactor file partitioning strategy to make it easier to extend / unit test Key: SPARK-19854 URL: https://issues.apache.org/jira/browse/SPARK-19854 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin The way we currently do file partitioning strategy is hard coded in FileSourceScanExec. This is not ideal for two reasons: 1. It is difficult to unit test the default strategy. In order to test this, we need to do almost end-to-end tests by creating actual files on the file system. 2. It is difficult to experiment with different partitioning strategies without adding a lot of if branches. The goal of this story is to create an internal interface for this so we can make this pluggable for both better testing and experimentation. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19855) Create an internal FilePartitionStrategy interface
Reynold Xin created SPARK-19855: --- Summary: Create an internal FilePartitionStrategy interface Key: SPARK-19855 URL: https://issues.apache.org/jira/browse/SPARK-19855 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin Assignee: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite into unit tests
Reynold Xin created SPARK-19856: --- Summary: Turn partitioning related test cases in FileSourceStrategySuite into unit tests Key: SPARK-19856 URL: https://issues.apache.org/jira/browse/SPARK-19856 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.1.0 Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19856) Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests
[ https://issues.apache.org/jira/browse/SPARK-19856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-19856: Summary: Turn partitioning related test cases in FileSourceStrategySuite from integration tests into unit tests (was: Turn partitioning related test cases in FileSourceStrategySuite into unit tests) > Turn partitioning related test cases in FileSourceStrategySuite from > integration tests into unit tests > -- > > Key: SPARK-19856 > URL: https://issues.apache.org/jira/browse/SPARK-19856 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19855: Assignee: Apache Spark (was: Reynold Xin) > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900315#comment-15900315 ] Apache Spark commented on SPARK-19855: -- User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/17196 > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19855) Create an internal FilePartitionStrategy interface
[ https://issues.apache.org/jira/browse/SPARK-19855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19855: Assignee: Reynold Xin (was: Apache Spark) > Create an internal FilePartitionStrategy interface > -- > > Key: SPARK-19855 > URL: https://issues.apache.org/jira/browse/SPARK-19855 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
Marcelo Vanzin created SPARK-19857: -- Summary: CredentialUpdater calculates the wrong time for next update Key: SPARK-19857 URL: https://issues.apache.org/jira/browse/SPARK-19857 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 2.1.0 Reporter: Marcelo Vanzin This is the code: {code} val remainingTime = getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) - System.currentTimeMillis() {code} If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19857: Assignee: Apache Spark > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Apache Spark > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19857: Assignee: (was: Apache Spark) > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900363#comment-15900363 ] Apache Spark commented on SPARK-19857: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/17198 > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
Shixiong Zhu created SPARK-19858: Summary: Add output mode to flatMapGroupsWithState and disallow invalid cases Key: SPARK-19858 URL: https://issues.apache.org/jira/browse/SPARK-19858 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 2.2.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19858: Assignee: Apache Spark (was: Shixiong Zhu) > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900377#comment-15900377 ] Apache Spark commented on SPARK-19858: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17197 > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19858) Add output mode to flatMapGroupsWithState and disallow invalid cases
[ https://issues.apache.org/jira/browse/SPARK-19858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19858: Assignee: Shixiong Zhu (was: Apache Spark) > Add output mode to flatMapGroupsWithState and disallow invalid cases > > > Key: SPARK-19858 > URL: https://issues.apache.org/jira/browse/SPARK-19858 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19859) The new watermark should override the old one
Shixiong Zhu created SPARK-19859: Summary: The new watermark should override the old one Key: SPARK-19859 URL: https://issues.apache.org/jira/browse/SPARK-19859 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0 Reporter: Shixiong Zhu Assignee: Shixiong Zhu The new watermark should override the old one. Otherwise, we just pick up the first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19859: Assignee: Apache Spark (was: Shixiong Zhu) > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Apache Spark > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900410#comment-15900410 ] Apache Spark commented on SPARK-19859: -- User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/17199 > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19859) The new watermark should override the old one
[ https://issues.apache.org/jira/browse/SPARK-19859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19859: Assignee: Shixiong Zhu (was: Apache Spark) > The new watermark should override the old one > - > > Key: SPARK-19859 > URL: https://issues.apache.org/jira/browse/SPARK-19859 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu > > The new watermark should override the old one. Otherwise, we just pick up the > first column which has a watermark, it may be unexpected. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19857) CredentialUpdater calculates the wrong time for next update
[ https://issues.apache.org/jira/browse/SPARK-19857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-19857. Resolution: Fixed Assignee: Marcelo Vanzin Fix Version/s: 2.2.0 2.1.1 > CredentialUpdater calculates the wrong time for next update > --- > > Key: SPARK-19857 > URL: https://issues.apache.org/jira/browse/SPARK-19857 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 2.1.0 >Reporter: Marcelo Vanzin >Assignee: Marcelo Vanzin > Fix For: 2.1.1, 2.2.0 > > > This is the code: > {code} > val remainingTime = > getTimeOfNextUpdateFromFileName(credentialsStatus.getPath) > - System.currentTimeMillis() > {code} > If you spot the problem, you get a virtual cookie. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19852) StringIndexer.setHandleInvalid should have another option 'new': Python API and docs
[ https://issues.apache.org/jira/browse/SPARK-19852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900463#comment-15900463 ] Vincent commented on SPARK-19852: - I can work on this issue, since it is related to SPARK-17498 > StringIndexer.setHandleInvalid should have another option 'new': Python API > and docs > > > Key: SPARK-19852 > URL: https://issues.apache.org/jira/browse/SPARK-19852 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Update Python API for StringIndexer so setHandleInvalid doc is correct. This > will probably require: > * putting HandleInvalid within StringIndexer to update its built-in doc (See > Bucketizer for an example.) > * updating API docs and maybe the guide -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19561) Pyspark Dataframes don't allow timestamps near epoch
[ https://issues.apache.org/jira/browse/SPARK-19561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900524#comment-15900524 ] Apache Spark commented on SPARK-19561: -- User 'JasonMWhite' has created a pull request for this issue: https://github.com/apache/spark/pull/17200 > Pyspark Dataframes don't allow timestamps near epoch > > > Key: SPARK-19561 > URL: https://issues.apache.org/jira/browse/SPARK-19561 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.0.1, 2.1.0 >Reporter: Jason White >Assignee: Jason White > Fix For: 2.1.1, 2.2.0 > > > Pyspark does not allow timestamps at or near the epoch to be created in a > DataFrame. Related issue: https://issues.apache.org/jira/browse/SPARK-19299 > TimestampType.toInternal converts a datetime object to a number representing > microseconds since the epoch. For all times more than 2148 seconds before or > after 1970-01-01T00:00:00+, this number is greater than 2^31 and Py4J > automatically serializes it as a long. > However, for times within this range (~35 minutes before or after the epoch), > Py4J serializes it as an int. When creating the object on the Scala side, > ints are not recognized and the value goes to null. This leads to null values > in non-nullable fields, and corrupted Parquet files. > The solution is trivial - force TimestampType.toInternal to always return a > long. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16333) Excessive Spark history event/json data size (5GB each)
[ https://issues.apache.org/jira/browse/SPARK-16333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900533#comment-15900533 ] Jim Kleckner commented on SPARK-16333: -- I ended up here when looking into why an upgrade of our streaming computation to 2.1.0 was pegging the network at a gigabit/second. Setting to spark.eventLog.enabled to false confirmed that this logging from slave port 50010 was the culprit. How can anyone with seriously large numbers of tasks use spark history with this amount of load? > Excessive Spark history event/json data size (5GB each) > --- > > Key: SPARK-16333 > URL: https://issues.apache.org/jira/browse/SPARK-16333 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.0.0 > Environment: this is seen on both x86 (Intel(R) Xeon(R), E5-2699 ) > and ppc platform (Habanero, Model: 8348-21C), Red Hat Enterprise Linux Server > release 7.2 (Maipo)., Spark2.0.0-preview (May-24, 2016 build) >Reporter: Peter Liu > Labels: performance, spark2.0.0 > > With Spark2.0.0-preview (May-24 build), the history event data (the json > file), that is generated for each Spark application run (see below), can be > as big as 5GB (instead of 14 MB for exactly the same application run and the > same input data of 1TB under Spark1.6.1) > -rwxrwx--- 1 root root 5.3G Jun 30 09:39 app-20160630091959- > -rwxrwx--- 1 root root 5.3G Jun 30 09:56 app-20160630094213- > -rwxrwx--- 1 root root 5.3G Jun 30 10:13 app-20160630095856- > -rwxrwx--- 1 root root 5.3G Jun 30 10:30 app-20160630101556- > The test is done with Sparkbench V2, SQL RDD (see github: > https://github.com/SparkTC/spark-bench) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-18055: Assignee: Michael Armbrust > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19810) Remove support for Scala 2.10
[ https://issues.apache.org/jira/browse/SPARK-19810?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900548#comment-15900548 ] Min Shen commented on SPARK-19810: -- [~srowen], Want to get an idea regarding the timeline for removing Scala 2.10. We have heavy usage of Spark at LinkedIn, and we are right now still deploying Spark built with Scala 2.10 due to various dependencies on other systems we have which still rely on Scala 2.10. While we also have plans to upgrade our various internal systems to start using Scala 2.11, it will take a while for that to happen. In the mean time, if support for Scala 2.10 is removed in Spark 2.2, this is going to potentially block us from upgrading to Spark 2.2+ while we haven't fully moved off Scala 2.10 yet. Want to raise this concern here and also to understand the timeline for removing Scala 2.10 in Spark. > Remove support for Scala 2.10 > - > > Key: SPARK-19810 > URL: https://issues.apache.org/jira/browse/SPARK-19810 > Project: Spark > Issue Type: Task > Components: ML, Spark Core, SQL >Affects Versions: 2.1.0 >Reporter: Sean Owen >Assignee: Sean Owen >Priority: Critical > > This tracks the removal of Scala 2.10 support, as discussed in > http://apache-spark-developers-list.1001551.n3.nabble.com/Straw-poll-dropping-support-for-things-like-Scala-2-10-td19553.html > and other lists. > The primary motivations are to simplify the code and build, and to enable > Scala 2.12 support later. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-18055: - Target Version/s: 2.2.0 > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18055: Assignee: Apache Spark (was: Michael Armbrust) > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Apache Spark > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-18055: Assignee: Michael Armbrust (was: Apache Spark) > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18055) Dataset.flatMap can't work with types from customized jar
[ https://issues.apache.org/jira/browse/SPARK-18055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900568#comment-15900568 ] Apache Spark commented on SPARK-18055: -- User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/17201 > Dataset.flatMap can't work with types from customized jar > - > > Key: SPARK-18055 > URL: https://issues.apache.org/jira/browse/SPARK-18055 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 >Reporter: Davies Liu >Assignee: Michael Armbrust > Attachments: test-jar_2.11-1.0.jar > > > Try to apply flatMap() on Dataset column which of of type > com.A.B > Here's a schema of a dataset: > {code} > root > |-- id: string (nullable = true) > |-- outputs: array (nullable = true) > ||-- element: string > {code} > flatMap works on RDD > {code} > ds.rdd.flatMap(_.outputs) > {code} > flatMap doesnt work on dataset and gives the following error > {code} > ds.flatMap(_.outputs) > {code} > The exception: > {code} > scala.ScalaReflectionException: class com.A.B in JavaMirror … not found > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123) > at scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22) > at > line189424fbb8cd47b3b62dc41e417841c159.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$typecreator3$1.apply(:51) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.SQLImplicits$$typecreator9$1.apply(SQLImplicits.scala:125) > at > scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232) > at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49) > at > org.apache.spark.sql.SQLImplicits.newProductSeqEncoder(SQLImplicits.scala:125) > {code} > Spoke to Michael Armbrust and he confirmed it as a Dataset bug. > There is a workaround using explode() > {code} > ds.select(explode(col("outputs"))) > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19860) DataFrame join get conflict error if two frames has a same name column.
wuchang created SPARK-19860: --- Summary: DataFrame join get conflict error if two frames has a same name column. Key: SPARK-19860 URL: https://issues.apache.org/jira/browse/SPARK-19860 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.1.0 Reporter: wuchang >>> print df1.collect() [Row(fdate=u'20170223', in_amount1=7758588), Row(fdate=u'20170302', in_amount1=7656414), Row(fdate=u'20170207', in_amount1=7836305), Row(fdate=u'20170208', in_amount1=14887432), Row(fdate=u'20170224', in_amount1=16506043), Row(fdate=u'20170201', in_amount1=7339381), Row(fdate=u'20170221', in_amount1=7490447), Row(fdate=u'20170303', in_amount1=11142114), Row(fdate=u'20170202', in_amount1=7882746), Row(fdate=u'20170306', in_amount1=12977822), Row(fdate=u'20170227', in_amount1=15480688), Row(fdate=u'20170206', in_amount1=11370812), Row(fdate=u'20170217', in_amount1=8208985), Row(fdate=u'20170203', in_amount1=8175477), Row(fdate=u'20170222', in_amount1=11032303), Row(fdate=u'20170216', in_amount1=11986702), Row(fdate=u'20170209', in_amount1=9082380), Row(fdate=u'20170214', in_amount1=8142569), Row(fdate=u'20170307', in_amount1=11092829), Row(fdate=u'20170213', in_amount1=12341887), Row(fdate=u'20170228', in_amount1=13966203), Row(fdate=u'20170220', in_amount1=9397558), Row(fdate=u'20170210', in_amount1=8205431), Row(fdate=u'20170215', in_amount1=7070829), Row(fdate=u'20170301', in_amount1=10159653)] >>> print df2.collect() [Row(fdate=u'20170223', in_amount2=7072120), Row(fdate=u'20170302', in_amount2=5548515), Row(fdate=u'20170207', in_amount2=5451110), Row(fdate=u'20170208', in_amount2=4483131), Row(fdate=u'20170224', in_amount2=9674888), Row(fdate=u'20170201', in_amount2=3227502), Row(fdate=u'20170221', in_amount2=5084800), Row(fdate=u'20170303', in_amount2=20577801), Row(fdate=u'20170202', in_amount2=4024218), Row(fdate=u'20170306', in_amount2=8581773), Row(fdate=u'20170227', in_amount2=5748035), Row(fdate=u'20170206', in_amount2=7330154), Row(fdate=u'20170217', in_amount2=6838105), Row(fdate=u'20170203', in_amount2=9390262), Row(fdate=u'20170222', in_amount2=3800662), Row(fdate=u'20170216', in_amount2=4338891), Row(fdate=u'20170209', in_amount2=4024611), Row(fdate=u'20170214', in_amount2=4030389), Row(fdate=u'20170307', in_amount2=5504936), Row(fdate=u'20170213', in_amount2=7142428), Row(fdate=u'20170228', in_amount2=8618951), Row(fdate=u'20170220', in_amount2=8172290), Row(fdate=u'20170210', in_amount2=8411312), Row(fdate=u'20170215', in_amount2=5302422), Row(fdate=u'20170301', in_amount2=9475418)] >>> ht_net_in_df = df1.join(df2,df1.fdate == df2.fdate,'inner') 2017-03-08 10:27:34,357 WARN [Thread-2] sql.Column: Constructing trivially true equals predicate, 'fdate#42 = fdate#42'. Perhaps you need to use aliases. Traceback (most recent call last): File "", line 1, in File "/home/spark/python/pyspark/sql/dataframe.py", line 652, in join jdf = self._jdf.join(other._jdf, on._jc, how) File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__ File "/home/spark/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u" Failure when resolving conflicting references in Join: 'Join Inner, (fdate#42 = fdate#42) :- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int) AS in_amount1#97] : +- Filter (inorout#44 = A) : +- Project [firm_id#40, partnerid#45, inorout#44, inoutmoney#47, fdate#42] :+- Filter (((partnerid#45 = pmec) && NOT (firm_id#40 = NULL)) && (NOT (firm_id#40 = -1) && (fdate#42 >= 20170201))) : +- SubqueryAlias history_transfer_v : +- Project [md5(cast(firmid#41 as binary)) AS FIRM_ID#40, fdate#42, ftime#43, inorout#44, partnerid#45, realdate#46, inoutmoney#47, bankwaterid#48, waterid#49, waterstate#50, source#51] : +- SubqueryAlias history_transfer :+- Relation[firmid#41,fdate#42,ftime#43,inorout#44,partnerid#45,realdate#46,inoutmoney#47,bankwaterid#48,waterid#49,waterstate#50,source#51] parquet +- Aggregate [fdate#42], [fdate#42, cast(sum(cast(inoutmoney#47 as double)) as int)
[jira] [Commented] (SPARK-18359) Let user specify locale in CSV parsing
[ https://issues.apache.org/jira/browse/SPARK-18359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900608#comment-15900608 ] Takeshi Yamamuro commented on SPARK-18359: -- Since JDK9 use CLDR as locale by default, it seems good to explicitly handle it: http://openjdk.java.net/jeps/252 > Let user specify locale in CSV parsing > -- > > Key: SPARK-18359 > URL: https://issues.apache.org/jira/browse/SPARK-18359 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.0.0, 2.0.1 >Reporter: yannick Radji > > On the DataFrameReader object there no CSV-specific option to set decimal > delimiter on comma whereas dot like it use to be in France and Europe. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-19861) watermark should not be a negative time.
Genmao Yu created SPARK-19861: - Summary: watermark should not be a negative time. Key: SPARK-19861 URL: https://issues.apache.org/jira/browse/SPARK-19861 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.1.0, 2.0.2 Reporter: Genmao Yu Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.
[ https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19861: Assignee: Apache Spark > watermark should not be a negative time. > > > Key: SPARK-19861 > URL: https://issues.apache.org/jira/browse/SPARK-19861 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-19861) watermark should not be a negative time.
[ https://issues.apache.org/jira/browse/SPARK-19861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-19861: Assignee: (was: Apache Spark) > watermark should not be a negative time. > > > Key: SPARK-19861 > URL: https://issues.apache.org/jira/browse/SPARK-19861 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.0.2, 2.1.0 >Reporter: Genmao Yu >Priority: Minor > -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org