[jira] [Commented] (SPARK-5059) list of user's objects in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263483#comment-14263483 ] Sean Owen commented on SPARK-5059: -- It makes some sense although the Spark shell is really just a clone of the Scala shell. Maybe it is best to contribute your patch there first. > list of user's objects in Spark REPL > > > Key: SPARK-5059 > URL: https://issues.apache.org/jira/browse/SPARK-5059 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Reporter: Tomas Hudik >Priority: Minor > Labels: spark-shell > > Often user do not remember all objects he has created in Spark REPL (shell). > It would be helpful to have an command that would list all such objects. E.g. > R is using *ls()* to list all objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263516#comment-14263516 ] Cheng Lian commented on SPARK-1529: --- Hi [~srowen], first of all we are not trying to put shuffle and temp files in HDFS. At the time this ticket was created, the initial motivation was to support MapR, because MapR only exposes local file system via MapR volume and HDFS {{FileSystem}} interface. However, later on this issue was worked around with NFS. And this ticket wasn't solved because of lacking enough capacity. [~rkannan82] Thanks for looking into this! Several months ago, I had once implemented a prototype by simply replacing Java NIO file system operations with corresponding HDFS {{FileSystem}} version. According to prior benchmark done with {{spark-perf}}, this introduces ~15% performance penalty for shuffling. Thus we had once planned to write a specialized {{FileSystem}} implementation which simply wraps normal Java NIO operations to avoid the performance penalty as much as possible, and then replace all local file system access with this specialized {{FileSystem}} implementation. > Support setting spark.local.dirs to a hadoop FileSystem > > > Key: SPARK-1529 > URL: https://issues.apache.org/jira/browse/SPARK-1529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Cheng Lian > > In some environments, like with MapR, local volumes are accessed through the > Hadoop filesystem interface. We should allow setting spark.local.dir to a > Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263539#comment-14263539 ] Alexander Bezzubov commented on SPARK-4923: --- Just FYI, same for us here https://github.com/NFLabs/zeppelin/issues/260 > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263578#comment-14263578 ] Sean Owen commented on SPARK-1529: -- Hm, how do these APIs preclude the direct use of java.io? Is this actually disabled in MapR? If there is a workaround what is the remaining motivation? > Support setting spark.local.dirs to a hadoop FileSystem > > > Key: SPARK-1529 > URL: https://issues.apache.org/jira/browse/SPARK-1529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Cheng Lian > > In some environments, like with MapR, local volumes are accessed through the > Hadoop filesystem interface. We should allow setting spark.local.dir to a > Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458 ] Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 5:35 PM: -- FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 was (Author: senkwich): FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458 ] Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:01 PM: -- FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 We are using quite a few different public methods from SparkIMain (such as pulling values back out of the interpreter), not just interpreter and bind. was (Author: senkwich): FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458 ] Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:02 PM: -- FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 We are using quite a few different public methods from SparkIMain (such as pulling values back out of the interpreter), not just interpret and bind. was (Author: senkwich): FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 We are using quite a few different public methods from SparkIMain (such as pulling values back out of the interpreter), not just interpreter and bind. > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458 ] Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:09 PM: -- FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 We are using quite a few different public methods from SparkIMain (such as valueOfTerm to pull out variables from the interpreter), not just interpret and bind. The API markings suggested by [~peng] would not be enough for us, [~pwendell]. was (Author: senkwich): FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel Specific issue: https://github.com/ibm-et/spark-kernel/issues/12 We are using quite a few different public methods from SparkIMain (such as pulling values back out of the interpreter), not just interpret and bind. > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263600#comment-14263600 ] Kannan Rajah commented on SPARK-1529: - [~lian cheng] Can you upload this prototype patch so that I can reuse it? What branch was it based off? When I start making new changes, I suppose I can do it against master branch, right? > Support setting spark.local.dirs to a hadoop FileSystem > > > Key: SPARK-1529 > URL: https://issues.apache.org/jira/browse/SPARK-1529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Cheng Lian > > In some environments, like with MapR, local volumes are accessed through the > Hadoop filesystem interface. We should allow setting spark.local.dir to a > Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263603#comment-14263603 ] Kannan Rajah commented on SPARK-1529: - It cannot preclude the use of java.io completely. If there are java.io. APIs that are needed for some use case, then you cannot use the HDFS API. But that is the case normally. The NFS mount based workaround is not as efficient as accessing it through the HDFS interface. Hence the need. > Support setting spark.local.dirs to a hadoop FileSystem > > > Key: SPARK-1529 > URL: https://issues.apache.org/jira/browse/SPARK-1529 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Cheng Lian > > In some environments, like with MapR, local volumes are accessed through the > Hadoop filesystem interface. We should allow setting spark.local.dir to a > Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures
[ https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263617#comment-14263617 ] Elmer Garduno commented on SPARK-5052: -- The problem seem to be fixed when using spark-submit instead of spark-class. > com.google.common.base.Optional binary has a wrong method signatures > > > Key: SPARK-5052 > URL: https://issues.apache.org/jira/browse/SPARK-5052 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Elmer Garduno > > PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved > Guava classes to package org.spark-project.guava when Spark is built by Maven. > When a user jar uses the actual com.google.common.base.Optional > transform(com.google.common.base.Function); method from Guava, a > java.lang.NoSuchMethodError: > com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional; > is thrown. > The reason seems to be that the Optional class included on > spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that > includes the shaded class as an argument: > Expected: > javap -classpath > target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(com.google.common.base.Function); > Found: > javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(org.spark-project.guava.common.base.Function); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures
[ https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elmer Garduno resolved SPARK-5052. -- Resolution: Not a Problem > com.google.common.base.Optional binary has a wrong method signatures > > > Key: SPARK-5052 > URL: https://issues.apache.org/jira/browse/SPARK-5052 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Elmer Garduno > > PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved > Guava classes to package org.spark-project.guava when Spark is built by Maven. > When a user jar uses the actual com.google.common.base.Optional > transform(com.google.common.base.Function); method from Guava, a > java.lang.NoSuchMethodError: > com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional; > is thrown. > The reason seems to be that the Optional class included on > spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that > includes the shaded class as an argument: > Expected: > javap -classpath > target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(com.google.common.base.Function); > Found: > javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(org.spark-project.guava.common.base.Function); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5059) list of user's objects in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263620#comment-14263620 ] Tomas Hudik commented on SPARK-5059: thanks Sean. Actually, after some googling I found Scala REPL (so Spark can do it as well) can list all objects: {quote} scala> val b = "hello" b: String = hello scala>__ {quote} (after pressing you will get all possible objects (those defined by a user as well)). I'm closing this issue > list of user's objects in Spark REPL > > > Key: SPARK-5059 > URL: https://issues.apache.org/jira/browse/SPARK-5059 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Reporter: Tomas Hudik >Priority: Minor > Labels: spark-shell > > Often user do not remember all objects he has created in Spark REPL (shell). > It would be helpful to have an command that would list all such objects. E.g. > R is using *ls()* to list all objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5059) list of user's objects in Spark REPL
[ https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomas Hudik closed SPARK-5059. -- Resolution: Not a Problem If you press __ button In Spark Shell (Scala REPL) you'll get a list of all objects > list of user's objects in Spark REPL > > > Key: SPARK-5059 > URL: https://issues.apache.org/jira/browse/SPARK-5059 > Project: Spark > Issue Type: New Feature > Components: Spark Shell >Reporter: Tomas Hudik >Priority: Minor > Labels: spark-shell > > Often user do not remember all objects he has created in Spark REPL (shell). > It would be helpful to have an command that would list all such objects. E.g. > R is using *ls()* to list all objects. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-609) Add instructions for enabling Akka debug logging
[ https://issues.apache.org/jira/browse/SPARK-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-609. -- Resolution: Incomplete I'm going to close out this issue for now, since I think it's no longer an issue in recent versions of Spark. Please comment / reopen if you think there's still something that needs fixing. > Add instructions for enabling Akka debug logging > > > Key: SPARK-609 > URL: https://issues.apache.org/jira/browse/SPARK-609 > Project: Spark > Issue Type: New Feature > Components: Documentation >Reporter: Josh Rosen >Priority: Minor > > How can I enable Akka debug logging in Spark? I tried setting > {{akka.loglevel = "DEBUG"}} in the configuration in {{AkkaUtils}}, and I also > tried setting properties in a {{log4j.conf}} file, but neither approach > worked. It might be helpful to have instructions for this in a "Spark > Internals Debugging" guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-988) Write PySpark profiling guide
[ https://issues.apache.org/jira/browse/SPARK-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-988. -- Resolution: Fixed We added distributed Python profiling support in 1.2 (see SPARK-3478). > Write PySpark profiling guide > - > > Key: SPARK-988 > URL: https://issues.apache.org/jira/browse/SPARK-988 > Project: Spark > Issue Type: New Feature > Components: Documentation, PySpark >Reporter: Josh Rosen > > Write a guide on profiling PySpark applications. I've done this in the past > by modifying the workers to make cProfile dumps, then using various tools to > collect and merge those dumps into an overall performance profile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2084) Mention SPARK_JAR in env var section on configuration page
[ https://issues.apache.org/jira/browse/SPARK-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-2084. --- Resolution: Won't Fix I'm resolving this as "Won't Fix" since the SPARK_JAR environment variable was deprecated in favor of a SparkConf property (this was done as part of the patch for SPARK-1395). > Mention SPARK_JAR in env var section on configuration page > -- > > Key: SPARK-2084 > URL: https://issues.apache.org/jira/browse/SPARK-2084 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)
[ https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263664#comment-14263664 ] Josh Rosen edited comment on SPARK-5023 at 1/3/15 10:57 PM: This sounds superficially similar to SPARK-4836; were there any failed stages in your logs? was (Author: joshrosen): This sounds superficially similar to SPARK-4836; were there any failed stages in your los? > In Web UI job history, the total job duration is incorrect (much smaller than > the sum of its stages) > > > Key: SPARK-5023 > URL: https://issues.apache.org/jira/browse/SPARK-5023 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1, 1.2.0 > Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, > running the ec2 provided scripts to create. >Reporter: Eran Medan > > I'm running a long process using Spark + Graph and things look good on the > 4040 job status UI, but when the job is done, when going to the history then > the job total duration is much, much smaller than the total of its stages. > The way I set logs up is this: > val homeDir = sys.props("user.home") > val logsPath = new File(homeDir,"sparkEventLogs") > val conf = new SparkConf().setAppName("...") > conf.set("spark.eventLog.enabled", "true").set("spark.eventLog.dir", > logsPath.getCanonicalPath) > for example job ID X - duration 0.2 s, but when I click the job and look at > its stages, the sum of their duration is more than 15 minutes! > (before the job was over, in the 4040 job status, the job duration was > correct, it is only incorrect when its done and going to the logs) > I hope I didn't configure something because I was very surprised no one > reported it yet (I searched, but perhaps I missed it) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)
[ https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263664#comment-14263664 ] Josh Rosen commented on SPARK-5023: --- This sounds superficially similar to SPARK-4836; were there any failed stages in your los? > In Web UI job history, the total job duration is incorrect (much smaller than > the sum of its stages) > > > Key: SPARK-5023 > URL: https://issues.apache.org/jira/browse/SPARK-5023 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1, 1.2.0 > Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, > running the ec2 provided scripts to create. >Reporter: Eran Medan > > I'm running a long process using Spark + Graph and things look good on the > 4040 job status UI, but when the job is done, when going to the history then > the job total duration is much, much smaller than the total of its stages. > The way I set logs up is this: > val homeDir = sys.props("user.home") > val logsPath = new File(homeDir,"sparkEventLogs") > val conf = new SparkConf().setAppName("...") > conf.set("spark.eventLog.enabled", "true").set("spark.eventLog.dir", > logsPath.getCanonicalPath) > for example job ID X - duration 0.2 s, but when I click the job and look at > its stages, the sum of their duration is more than 15 minutes! > (before the job was over, in the 4040 job status, the job duration was > correct, it is only incorrect when its done and going to the logs) > I hope I didn't configure something because I was very surprised no one > reported it yet (I searched, but perhaps I missed it) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl
[ https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263681#comment-14263681 ] Peng Cheng commented on SPARK-4923: --- You are right, in fact 'Dev's API' simply means method is susceptible to changes without deprecation or notice, which the main 3 markings will be least likely to undergo. Could you please edit the patch and add more markings? > Maven build should keep publishing spark-repl > - > > Key: SPARK-4923 > URL: https://issues.apache.org/jira/browse/SPARK-4923 > Project: Spark > Issue Type: Bug > Components: Build, Spark Shell >Affects Versions: 1.2.0 >Reporter: Peng Cheng >Priority: Critical > Labels: shell > Attachments: > SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch > > Original Estimate: 1h > Remaining Estimate: 1h > > Spark-repl installation and deployment has been discontinued (see > SPARK-3452). But its in the dependency list of a few projects that extends > its initialization process. > Please remove the 'skip' setting in spark-repl and make it an 'official' API > to encourage more platform to integrate with it. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263694#comment-14263694 ] Saisai Shao commented on SPARK-5001: Hi [~hanhg], I don't think it is a problem of Spark Streaming, from my understanding, mostly this exception is introduced by your unstable running status, say the processing delay is larger than the tolerated batch interval. IMO I think you should tune your application, rather than modifying the Spark Streaming in this way. The patch you submitted is not a proper way to solve the problem from my understanding, it will break the internal logic of Spark Streaming. > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures
[ https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elmer Garduno reopened SPARK-5052: -- False alarm, still getting the same errors when running on the standalone cluster. > com.google.common.base.Optional binary has a wrong method signatures > > > Key: SPARK-5052 > URL: https://issues.apache.org/jira/browse/SPARK-5052 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Elmer Garduno > > PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved > Guava classes to package org.spark-project.guava when Spark is built by Maven. > When a user jar uses the actual com.google.common.base.Optional > transform(com.google.common.base.Function); method from Guava, a > java.lang.NoSuchMethodError: > com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional; > is thrown. > The reason seems to be that the Optional class included on > spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that > includes the shaded class as an argument: > Expected: > javap -classpath > target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(com.google.common.base.Function); > Found: > javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar > com.google.common.base.Optional > public abstract > com.google.common.base.Optional > transform(org.spark-project.guava.common.base.Function); -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
[ https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263697#comment-14263697 ] Apache Spark commented on SPARK-4986: - User 'cleaton' has created a pull request for this issue: https://github.com/apache/spark/pull/3868 > Graceful shutdown for Spark Streaming does not work in Standalone cluster mode > -- > > Key: SPARK-4986 > URL: https://issues.apache.org/jira/browse/SPARK-4986 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Jesper Lundgren > > When using the graceful stop API of Spark Streaming in Spark Standalone > cluster the stop signal never reaches the receivers. I have tested this with > Spark 1.2 and Kafka receivers. > ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl. > In local mode ReceiverSupervisorImpl receives this message but in Standalone > cluster mode the message seems to be lost. > (I have modified the code to send my own string message as a stop signal from > ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5064) GraphX rmatGraph hangs
Michael Malak created SPARK-5064: Summary: GraphX rmatGraph hangs Key: SPARK-5064 URL: https://issues.apache.org/jira/browse/SPARK-5064 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 1.2.0 Environment: CentOS 7 REPL (no HDFS). Also tried Cloudera 5.2.0 QuickStart standalone compiled Scala with spark-submit. Reporter: Michael Malak org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8) It just outputs "0 edges" and then locks up. A spark-user message reports similar behavior: http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hanhonggen updated SPARK-5001: -- Comment: was deleted (was: I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough.) > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263701#comment-14263701 ] hanhonggen commented on SPARK-5001: --- I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough. > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263702#comment-14263702 ] hanhonggen commented on SPARK-5001: --- I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough. > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263699#comment-14263699 ] hanhonggen commented on SPARK-5001: --- I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough. > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263700#comment-14263700 ] hanhonggen commented on SPARK-5001: --- I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough. > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hanhonggen updated SPARK-5001: -- Comment: was deleted (was: I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough.) > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] hanhonggen updated SPARK-5001: -- Comment: was deleted (was: I think it's impossible to make sure that all jobs generated in spark streaming will finish orderly. My patch may be not a proper way. But the current logic of Spark Streaming is too rough.) > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming
[ https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263704#comment-14263704 ] Saisai Shao commented on SPARK-5001: Yeah, It's Spark Streaming design model's restriction, for some operators like window operators this restriction is meaningful. Also there're some tuning guides to point out how to choose batch interval to keep Spark Streaming cluster in good status, I think it is easier to tune the app rather than change the logical of Spark Streaming IIUC :). > BlockRDD removed unreasonablly in streaming > --- > > Key: SPARK-5001 > URL: https://issues.apache.org/jira/browse/SPARK-5001 > Project: Spark > Issue Type: Bug >Affects Versions: 1.0.2, 1.1.1, 1.2.0 >Reporter: hanhonggen > Attachments: > fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch > > > I've counted messages using kafkainputstream of spark-1.1.1. The test app > failed when the latter batch job completed sooner than the previous. In the > source code, BlockRDDs older than (time-rememberDuration) will be removed in > cleanMetaData after one job completed. And the previous job will abort due to > block not found.The relevant log are as follows: > 2014-12-25 > 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487632000 ms.0 from job set of time > 1419487632000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO > :Starting job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO > :Finished job streaming job 1419487635000 ms.0 from job set of time > 1419487635000 ms > 2014-12-25 > 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO > :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 > of time 1419487635000 ms from DStream clearMetadata > java.lang.Exception: Could not compute split, block input-0-1419487631400 not > found for 3028 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5065) BroadCast can still work after sc had been stopped.
SaintBacchus created SPARK-5065: --- Summary: BroadCast can still work after sc had been stopped. Key: SPARK-5065 URL: https://issues.apache.org/jira/browse/SPARK-5065 Project: Spark Issue Type: Bug Affects Versions: 1.2.1 Reporter: SaintBacchus Code as follow: {code:borderStyle=solid} val sc1 = new SparkContext val sc2 = new SparkContext sc1.broadcast(1) sc1.stop {code} It can work well, because sc1.broadcast will reuse the BlockManager in sc2. To fix it, throw a sparkException when broadCastManager had stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5065) BroadCast can still work after sc had been stopped.
[ https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SaintBacchus updated SPARK-5065: Description: Code as follow: {code:borderStyle=solid} val sc1 = new SparkContext val sc2 = new SparkContext sc1.stop sc1.broadcast(1) {code} It can work well, because sc1.broadcast will reuse the BlockManager in sc2. To fix it, throw a sparkException when broadCastManager had stopped. was: Code as follow: {code:borderStyle=solid} val sc1 = new SparkContext val sc2 = new SparkContext sc1.broadcast(1) sc1.stop {code} It can work well, because sc1.broadcast will reuse the BlockManager in sc2. To fix it, throw a sparkException when broadCastManager had stopped. > BroadCast can still work after sc had been stopped. > --- > > Key: SPARK-5065 > URL: https://issues.apache.org/jira/browse/SPARK-5065 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: SaintBacchus > > Code as follow: > {code:borderStyle=solid} > val sc1 = new SparkContext > val sc2 = new SparkContext > sc1.stop > sc1.broadcast(1) > {code} > It can work well, because sc1.broadcast will reuse the BlockManager in sc2. > To fix it, throw a sparkException when broadCastManager had stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5065) BroadCast can still work after sc had been stopped.
[ https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263729#comment-14263729 ] Apache Spark commented on SPARK-5065: - User 'SaintBacchus' has created a pull request for this issue: https://github.com/apache/spark/pull/3885 > BroadCast can still work after sc had been stopped. > --- > > Key: SPARK-5065 > URL: https://issues.apache.org/jira/browse/SPARK-5065 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.1 >Reporter: SaintBacchus > > Code as follow: > {code:borderStyle=solid} > val sc1 = new SparkContext > val sc2 = new SparkContext > sc1.stop > sc1.broadcast(1) > {code} > It can work well, because sc1.broadcast will reuse the BlockManager in sc2. > To fix it, throw a sparkException when broadCastManager had stopped. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5066) Can not get all value that has the same key when reading key ordered from different Streaming.
DoingDone9 created SPARK-5066: - Summary: Can not get all value that has the same key when reading key ordered from different Streaming. Key: SPARK-5066 URL: https://issues.apache.org/jira/browse/SPARK-5066 Project: Spark Issue Type: Bug Reporter: DoingDone9 Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5066) Can not get all key when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5066: -- Summary: Can not get all key when reading key ordered from different Streaming. (was: Can not get all value that has the same key when reading key ordered from different Streaming.) > Can not get all key when reading key ordered from different Streaming. > > > Key: SPARK-5066 > URL: https://issues.apache.org/jira/browse/SPARK-5066 > Project: Spark > Issue Type: Bug >Reporter: DoingDone9 >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5066: -- Summary: Can not get all key that has same hashcode when reading key ordered from different Streaming. (was: Can not get all key when reading key ordered from different Streaming.) > Can not get all key that has same hashcode when reading key ordered from > different Streaming. > --- > > Key: SPARK-5066 > URL: https://issues.apache.org/jira/browse/SPARK-5066 > Project: Spark > Issue Type: Bug >Reporter: DoingDone9 >Priority: Critical > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5066: -- Description: when spill is open, data ordered by hashCode will be spilled to disk. We need get all key that has the same hashCode from different tmp files when merge value, but it just read the key that has the minHashCode that in a tmp file, we can not read all key. Example : If file1 has [k1, k2, k3], file2 has [k4,k5,k1]. And hashcode of k4 < hashcode of k5 < hashcode of k1 < hashcode of k2 < hashcode of k3 we just read k1 from file1 and k4 from file2. Can not read all k1. Code : private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => it.buffered) inputStreams.foreach { it => val kcPairs = new ArrayBuffer[(K, C)] readNextHashCode(it, kcPairs) if (kcPairs.length > 0) { mergeHeap.enqueue(new StreamBuffer(it, kcPairs)) } } private def readNextHashCode(it: BufferedIterator[(K, C)], buf: ArrayBuffer[(K, C)]): Unit = { if (it.hasNext) { var kc = it.next() buf += kc val minHash = hashKey(kc) while (it.hasNext && it.head._1.hashCode() == minHash) { kc = it.next() buf += kc } } } > Can not get all key that has same hashcode when reading key ordered from > different Streaming. > --- > > Key: SPARK-5066 > URL: https://issues.apache.org/jira/browse/SPARK-5066 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: DoingDone9 >Priority: Critical > > when spill is open, data ordered by hashCode will be spilled to disk. We need > get all key that has the same hashCode from different tmp files when merge > value, but it just read the key that has the minHashCode that in a tmp file, > we can not read all key. > Example : > If file1 has [k1, k2, k3], file2 has [k4,k5,k1]. > And hashcode of k4 < hashcode of k5 < hashcode of k1 < hashcode of k2 < > hashcode of k3 > we just read k1 from file1 and k4 from file2. Can not read all k1. > Code : > private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => > it.buffered) > inputStreams.foreach { it => > val kcPairs = new ArrayBuffer[(K, C)] > readNextHashCode(it, kcPairs) > if (kcPairs.length > 0) { > mergeHeap.enqueue(new StreamBuffer(it, kcPairs)) > } > } > private def readNextHashCode(it: BufferedIterator[(K, C)], buf: > ArrayBuffer[(K, C)]): Unit = { > if (it.hasNext) { > var kc = it.next() > buf += kc > val minHash = hashKey(kc) > while (it.hasNext && it.head._1.hashCode() == minHash) { > kc = it.next() > buf += kc > } > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.
[ https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DoingDone9 updated SPARK-5066: -- Affects Version/s: 1.2.0 > Can not get all key that has same hashcode when reading key ordered from > different Streaming. > --- > > Key: SPARK-5066 > URL: https://issues.apache.org/jira/browse/SPARK-5066 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: DoingDone9 >Priority: Critical > > when spill is open, data ordered by hashCode will be spilled to disk. We need > get all key that has the same hashCode from different tmp files when merge > value, but it just read the key that has the minHashCode that in a tmp file, > we can not read all key. > Example : > If file1 has [k1, k2, k3], file2 has [k4,k5,k1]. > And hashcode of k4 < hashcode of k5 < hashcode of k1 < hashcode of k2 < > hashcode of k3 > we just read k1 from file1 and k4 from file2. Can not read all k1. > Code : > private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => > it.buffered) > inputStreams.foreach { it => > val kcPairs = new ArrayBuffer[(K, C)] > readNextHashCode(it, kcPairs) > if (kcPairs.length > 0) { > mergeHeap.enqueue(new StreamBuffer(it, kcPairs)) > } > } > private def readNextHashCode(it: BufferedIterator[(K, C)], buf: > ArrayBuffer[(K, C)]): Unit = { > if (it.hasNext) { > var kc = it.next() > buf += kc > val minHash = hashKey(kc) > while (it.hasNext && it.head._1.hashCode() == minHash) { > kc = it.next() > buf += kc > } > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId
Shixiong Zhu created SPARK-5067: --- Summary: testTaskInfo doesn't compare SparkListenerApplicationStart.appId Key: SPARK-5067 URL: https://issues.apache.org/jira/browse/SPARK-5067 Project: Spark Issue Type: Test Reporter: Shixiong Zhu Priority: Minor In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare SparkListenerApplicationStart.appId when comparing two "SparkListenerApplicationStart"s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5058) Typos and broken URL
[ https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-5058. -- Resolution: Fixed Fix Version/s: 1.3.0 > Typos and broken URL > > > Key: SPARK-5058 > URL: https://issues.apache.org/jira/browse/SPARK-5058 > Project: Spark > Issue Type: Documentation > Components: Streaming >Affects Versions: 1.2.0 >Reporter: AkhlD >Priority: Minor > Fix For: 1.3.0, 1.2.1 > > > Spark Streaming + Kafka Integration Guide has a broken Examples link. Also > project is spelled as projrect. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId
[ https://issues.apache.org/jira/browse/SPARK-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu updated SPARK-5067: Component/s: Spark Core > testTaskInfo doesn't compare SparkListenerApplicationStart.appId > > > Key: SPARK-5067 > URL: https://issues.apache.org/jira/browse/SPARK-5067 > Project: Spark > Issue Type: Test > Components: Spark Core >Reporter: Shixiong Zhu >Priority: Minor > > In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare > SparkListenerApplicationStart.appId when comparing two > "SparkListenerApplicationStart"s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId
[ https://issues.apache.org/jira/browse/SPARK-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263748#comment-14263748 ] Apache Spark commented on SPARK-5067: - User 'zsxwing' has created a pull request for this issue: https://github.com/apache/spark/pull/3886 > testTaskInfo doesn't compare SparkListenerApplicationStart.appId > > > Key: SPARK-5067 > URL: https://issues.apache.org/jira/browse/SPARK-5067 > Project: Spark > Issue Type: Test > Components: Spark Core >Reporter: Shixiong Zhu >Priority: Minor > > In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare > SparkListenerApplicationStart.appId when comparing two > "SparkListenerApplicationStart"s -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors
[ https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263761#comment-14263761 ] koert kuipers commented on SPARK-3306: -- i am also interested in resources that can be shared across tasks inside the executor, with a way to close/cleanup the resources when the executor shuts down > Addition of external resource dependency in executors > - > > Key: SPARK-3306 > URL: https://issues.apache.org/jira/browse/SPARK-3306 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Yan > > Currently, Spark executors only support static and read-only external > resources of side files and jar files. With emerging disparate data sources, > there is a need to support more versatile external resources, such as > connections to data sources, to facilitate efficient data accesses to the > sources. For one, the JDBCRDD, with some modifications, could benefit from > this feature by reusing established JDBC connections from the same Spark > context before. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5068) When the path not found in the hdfs,we can't get the result
jeanlyn created SPARK-5068: -- Summary: When the path not found in the hdfs,we can't get the result Key: SPARK-5068 URL: https://issues.apache.org/jira/browse/SPARK-5068 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0 Reporter: jeanlyn when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: ``` hive> show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) ``` ``` hive> dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 ``` when i run the sq `select * from partition_test limit 10`l in **hive**,i got no problem,but when i run in spark-sql i get the error as follow: ``` Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) ``` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn updated SPARK-5068: --- Description: when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: ``` hive> show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) ``` ``` hive> dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 ``` when i run the sq `select * from partition_test limit 10` in **hive**,i got no problem,but when i run in spark-sql i get the error as follow: ``` Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) ``` was: when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: ``` hive> show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) ``` ``` hive> dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 ``` when i run the sq `select * from partition_test limit 10`l in **hiv
[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result
[ https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jeanlyn updated SPARK-5068: --- Description: when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: {noformat} hive> show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) {noformat} {noformat} hive> dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 {noformat} when i run the sql {noformat} select * from partition_test limit 10 {noformat} in *hive*,i got no problem,but when i run in *spark-sql* i get the error as follow: {noformat} Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:203) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328) at org.apache.spark.rdd.RDD.collect(RDD.scala:780) at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84) at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444) at org.apache.spark.sql.hive.testpartition$.main(test.scala:23) at org.apache.spark.sql.hive.testpartition.main(test.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134) {noformat} was: when the partion path was found in the metastore but not found in the hdfs,it will casue some problems as follow: ``` hive> show partitions partition_test; OK dt=1 dt=2 dt=3 dt=4 Time taken: 0.168 seconds, Fetched: 4 row(s) ``` ``` hive> dfs -ls /user/jeanlyn/warehouse/partition_test; Found 3 items drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=1 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 16:29 /user/jeanlyn/warehouse/partition_test/dt=3 drwxr-xr-x - jeanlyn supergroup 0 2014-12-02 17:42 /user/jeanlyn/warehouse/partition_test/dt=4 ``` when