[jira] [Commented] (SPARK-5059) list of user's objects in Spark REPL

2015-01-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263483#comment-14263483
 ] 

Sean Owen commented on SPARK-5059:
--

It makes some sense although the Spark shell is really just a clone of the 
Scala shell. Maybe it is best to contribute your patch there first.

> list of user's objects in Spark REPL
> 
>
> Key: SPARK-5059
> URL: https://issues.apache.org/jira/browse/SPARK-5059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: Tomas Hudik
>Priority: Minor
>  Labels: spark-shell
>
> Often user do not remember all objects he has created in Spark REPL (shell). 
> It would be helpful to have an command that would list all such objects. E.g. 
> R is using *ls()* to list all objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263516#comment-14263516
 ] 

Cheng Lian commented on SPARK-1529:
---

Hi [~srowen], first of all we are not trying to put shuffle and temp files in 
HDFS. At the time this ticket was created, the initial motivation was to 
support MapR, because MapR only exposes local file system via MapR volume and 
HDFS {{FileSystem}} interface. However, later on this issue was worked around 
with NFS. And this ticket wasn't solved because of lacking enough capacity.

[~rkannan82] Thanks for looking into this! Several months ago, I had once 
implemented a prototype by simply replacing Java NIO file system operations 
with corresponding HDFS {{FileSystem}} version. According to prior benchmark 
done with {{spark-perf}}, this introduces ~15% performance penalty for 
shuffling. Thus we had once planned to write a specialized {{FileSystem}} 
implementation which simply wraps normal Java NIO operations to avoid the 
performance penalty as much as possible, and then replace all local file system 
access with this specialized {{FileSystem}} implementation.

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Alexander Bezzubov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263539#comment-14263539
 ] 

Alexander Bezzubov commented on SPARK-4923:
---

Just FYI, same for us here https://github.com/NFLabs/zeppelin/issues/260

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263578#comment-14263578
 ] 

Sean Owen commented on SPARK-1529:
--

Hm, how do these APIs preclude the direct use of java.io? Is this actually 
disabled in MapR? If there is a workaround what is the remaining motivation? 

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458
 ] 

Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 5:35 PM:
--

FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12


was (Author: senkwich):
FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458
 ] 

Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:01 PM:
--

FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

We are using quite a few different public methods from SparkIMain (such as 
pulling values back out of the interpreter), not just interpreter and bind.


was (Author: senkwich):
FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458
 ] 

Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:02 PM:
--

FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

We are using quite a few different public methods from SparkIMain (such as 
pulling values back out of the interpreter), not just interpret and bind.


was (Author: senkwich):
FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

We are using quite a few different public methods from SparkIMain (such as 
pulling values back out of the interpreter), not just interpreter and bind.

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Chip Senkbeil (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263458#comment-14263458
 ] 

Chip Senkbeil edited comment on SPARK-4923 at 1/3/15 6:09 PM:
--

FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

We are using quite a few different public methods from SparkIMain (such as 
valueOfTerm to pull out variables from the interpreter), not just interpret and 
bind. The API markings suggested by [~peng] would not be enough for us, 
[~pwendell].


was (Author: senkwich):
FYI, this is a blocker for us as well: https://github.com/ibm-et/spark-kernel

Specific issue: https://github.com/ibm-et/spark-kernel/issues/12

We are using quite a few different public methods from SparkIMain (such as 
pulling values back out of the interpreter), not just interpret and bind.

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263600#comment-14263600
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~lian cheng] Can you upload this prototype patch so that I can reuse it? What 
branch was it based off? When I start making new changes, I suppose I can do it 
against master branch, right?

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263603#comment-14263603
 ] 

Kannan Rajah commented on SPARK-1529:
-

It cannot preclude the use of java.io completely. If there are java.io. APIs 
that are needed for some use case, then you cannot use the HDFS API. But  that 
is the case normally.

The NFS mount based workaround is not as efficient as accessing it through the 
HDFS interface. Hence the need.

> Support setting spark.local.dirs to a hadoop FileSystem 
> 
>
> Key: SPARK-1529
> URL: https://issues.apache.org/jira/browse/SPARK-1529
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Cheng Lian
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. We should allow setting spark.local.dir to a 
> Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures

2015-01-03 Thread Elmer Garduno (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263617#comment-14263617
 ] 

Elmer Garduno commented on SPARK-5052:
--

The problem seem to be fixed when using spark-submit instead of spark-class.

> com.google.common.base.Optional binary has a wrong method signatures
> 
>
> Key: SPARK-5052
> URL: https://issues.apache.org/jira/browse/SPARK-5052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Elmer Garduno
>
> PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved 
> Guava classes to package org.spark-project.guava when Spark is built by Maven.
> When a user jar uses the actual com.google.common.base.Optional 
> transform(com.google.common.base.Function); method from Guava,  a 
> java.lang.NoSuchMethodError: 
> com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional;
>  is thrown.
> The reason seems to be that the Optional class included on 
> spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that 
> includes the shaded class as an argument:
> Expected:
> javap -classpath 
> target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(com.google.common.base.Function);
> Found:
> javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(org.spark-project.guava.common.base.Function);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures

2015-01-03 Thread Elmer Garduno (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elmer Garduno resolved SPARK-5052.
--
Resolution: Not a Problem

> com.google.common.base.Optional binary has a wrong method signatures
> 
>
> Key: SPARK-5052
> URL: https://issues.apache.org/jira/browse/SPARK-5052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Elmer Garduno
>
> PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved 
> Guava classes to package org.spark-project.guava when Spark is built by Maven.
> When a user jar uses the actual com.google.common.base.Optional 
> transform(com.google.common.base.Function); method from Guava,  a 
> java.lang.NoSuchMethodError: 
> com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional;
>  is thrown.
> The reason seems to be that the Optional class included on 
> spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that 
> includes the shaded class as an argument:
> Expected:
> javap -classpath 
> target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(com.google.common.base.Function);
> Found:
> javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(org.spark-project.guava.common.base.Function);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5059) list of user's objects in Spark REPL

2015-01-03 Thread Tomas Hudik (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263620#comment-14263620
 ] 

Tomas Hudik commented on SPARK-5059:


thanks Sean.

Actually, after some googling I found Scala REPL (so Spark can do it as well) 
can list all objects:
{quote}
scala> val b = "hello"
b: String = hello
scala>__
{quote}

(after pressing  you will get all possible objects (those defined by a 
user as well)). 
I'm closing this issue

> list of user's objects in Spark REPL
> 
>
> Key: SPARK-5059
> URL: https://issues.apache.org/jira/browse/SPARK-5059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: Tomas Hudik
>Priority: Minor
>  Labels: spark-shell
>
> Often user do not remember all objects he has created in Spark REPL (shell). 
> It would be helpful to have an command that would list all such objects. E.g. 
> R is using *ls()* to list all objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5059) list of user's objects in Spark REPL

2015-01-03 Thread Tomas Hudik (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Hudik closed SPARK-5059.
--
Resolution: Not a Problem

If you press __ button  In Spark Shell (Scala REPL) you'll get a list of 
all objects

> list of user's objects in Spark REPL
> 
>
> Key: SPARK-5059
> URL: https://issues.apache.org/jira/browse/SPARK-5059
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Shell
>Reporter: Tomas Hudik
>Priority: Minor
>  Labels: spark-shell
>
> Often user do not remember all objects he has created in Spark REPL (shell). 
> It would be helpful to have an command that would list all such objects. E.g. 
> R is using *ls()* to list all objects.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-609) Add instructions for enabling Akka debug logging

2015-01-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-609.
--
Resolution: Incomplete

I'm going to close out this issue for now, since I think it's no longer an 
issue in recent versions of Spark.  Please comment / reopen if you think 
there's still something that needs fixing.

> Add instructions for enabling Akka debug logging
> 
>
> Key: SPARK-609
> URL: https://issues.apache.org/jira/browse/SPARK-609
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Reporter: Josh Rosen
>Priority: Minor
>
> How can I enable Akka debug logging in Spark?  I tried setting 
> {{akka.loglevel = "DEBUG"}} in the configuration in {{AkkaUtils}}, and I also 
> tried setting properties in a {{log4j.conf}} file, but neither approach 
> worked.  It might be helpful to have instructions for this in a "Spark 
> Internals Debugging" guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-988) Write PySpark profiling guide

2015-01-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-988.
--
Resolution: Fixed

We added distributed Python profiling support in 1.2 (see SPARK-3478).

> Write PySpark profiling guide
> -
>
> Key: SPARK-988
> URL: https://issues.apache.org/jira/browse/SPARK-988
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation, PySpark
>Reporter: Josh Rosen
>
> Write a guide on profiling PySpark applications.  I've done this in the past 
> by modifying the workers to make cProfile dumps, then using various tools to 
> collect and merge those dumps into an overall performance profile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2084) Mention SPARK_JAR in env var section on configuration page

2015-01-03 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2084.
---
Resolution: Won't Fix

I'm resolving this as "Won't Fix" since the SPARK_JAR environment variable was 
deprecated in favor of a SparkConf property (this was done as part of the patch 
for SPARK-1395).

> Mention SPARK_JAR in env var section on configuration page
> --
>
> Key: SPARK-2084
> URL: https://issues.apache.org/jira/browse/SPARK-2084
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)

2015-01-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263664#comment-14263664
 ] 

Josh Rosen edited comment on SPARK-5023 at 1/3/15 10:57 PM:


This sounds superficially similar to SPARK-4836; were there any failed stages 
in your logs?


was (Author: joshrosen):
This sounds superficially similar to SPARK-4836; were there any failed stages 
in your los?

> In Web UI job history, the total job duration is incorrect (much smaller than 
> the sum of its stages)
> 
>
> Key: SPARK-5023
> URL: https://issues.apache.org/jira/browse/SPARK-5023
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1, 1.2.0
> Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, 
> running the ec2 provided scripts to create. 
>Reporter: Eran Medan
>
> I'm running a long process using Spark + Graph and things look good on the 
> 4040 job status UI, but when the job is done, when going to the history then 
> the job total duration is much, much smaller than the total of its stages.
> The way I set logs up is this:
>   val homeDir = sys.props("user.home")
>   val logsPath = new File(homeDir,"sparkEventLogs")
>   val conf = new SparkConf().setAppName("...")
>   conf.set("spark.eventLog.enabled", "true").set("spark.eventLog.dir", 
> logsPath.getCanonicalPath)
> for example job ID X - duration 0.2 s, but when I click the job and look at 
> its stages, the sum of their duration is more than 15 minutes!
> (before the job was over, in the 4040 job status, the job duration was 
> correct, it is only incorrect when its done and going to the logs) 
> I hope I didn't configure something because I was very surprised no one 
> reported it yet (I searched, but perhaps I missed it) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5023) In Web UI job history, the total job duration is incorrect (much smaller than the sum of its stages)

2015-01-03 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263664#comment-14263664
 ] 

Josh Rosen commented on SPARK-5023:
---

This sounds superficially similar to SPARK-4836; were there any failed stages 
in your los?

> In Web UI job history, the total job duration is incorrect (much smaller than 
> the sum of its stages)
> 
>
> Key: SPARK-5023
> URL: https://issues.apache.org/jira/browse/SPARK-5023
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1, 1.2.0
> Environment: Amazon EC2 AMI r3.2xlarge, cluster of 20 to 50 nodes, 
> running the ec2 provided scripts to create. 
>Reporter: Eran Medan
>
> I'm running a long process using Spark + Graph and things look good on the 
> 4040 job status UI, but when the job is done, when going to the history then 
> the job total duration is much, much smaller than the total of its stages.
> The way I set logs up is this:
>   val homeDir = sys.props("user.home")
>   val logsPath = new File(homeDir,"sparkEventLogs")
>   val conf = new SparkConf().setAppName("...")
>   conf.set("spark.eventLog.enabled", "true").set("spark.eventLog.dir", 
> logsPath.getCanonicalPath)
> for example job ID X - duration 0.2 s, but when I click the job and look at 
> its stages, the sum of their duration is more than 15 minutes!
> (before the job was over, in the 4040 job status, the job duration was 
> correct, it is only incorrect when its done and going to the logs) 
> I hope I didn't configure something because I was very surprised no one 
> reported it yet (I searched, but perhaps I missed it) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4923) Maven build should keep publishing spark-repl

2015-01-03 Thread Peng Cheng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263681#comment-14263681
 ] 

Peng Cheng commented on SPARK-4923:
---

You are right, in fact 'Dev's API' simply means method is susceptible to 
changes without deprecation or notice, which the main 3 markings will be least 
likely to undergo.
Could you please edit the patch and add more markings?

> Maven build should keep publishing spark-repl
> -
>
> Key: SPARK-4923
> URL: https://issues.apache.org/jira/browse/SPARK-4923
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Spark Shell
>Affects Versions: 1.2.0
>Reporter: Peng Cheng
>Priority: Critical
>  Labels: shell
> Attachments: 
> SPARK-4923__Maven_build_should_keep_publishing_spark-repl.patch
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Spark-repl installation and deployment has been discontinued (see 
> SPARK-3452). But its in the dependency list of a few projects that extends 
> its initialization process.
> Please remove the 'skip' setting in spark-repl and make it an 'official' API 
> to encourage more platform to integrate with it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263694#comment-14263694
 ] 

Saisai Shao commented on SPARK-5001:


Hi [~hanhg], I don't think it is a problem of Spark Streaming, from my 
understanding, mostly this exception is introduced by your unstable running 
status, say the processing delay is larger than the tolerated batch interval. 
IMO I think you should tune your application, rather than modifying the Spark 
Streaming in this way. The patch you submitted is not a proper way to solve the 
problem from my understanding, it will break the internal logic of Spark 
Streaming.

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-5052) com.google.common.base.Optional binary has a wrong method signatures

2015-01-03 Thread Elmer Garduno (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elmer Garduno reopened SPARK-5052:
--

False alarm, still getting the same errors when running on the standalone 
cluster.

> com.google.common.base.Optional binary has a wrong method signatures
> 
>
> Key: SPARK-5052
> URL: https://issues.apache.org/jira/browse/SPARK-5052
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Elmer Garduno
>
> PR https://github.com/apache/spark/pull/1813 shaded Guava jar file and moved 
> Guava classes to package org.spark-project.guava when Spark is built by Maven.
> When a user jar uses the actual com.google.common.base.Optional 
> transform(com.google.common.base.Function); method from Guava,  a 
> java.lang.NoSuchMethodError: 
> com.google.common.base.Optional.transform(Lcom/google/common/base/Function;)Lcom/google/common/base/Optional;
>  is thrown.
> The reason seems to be that the Optional class included on 
> spark-assembly-1.2.0-hadoop1.0.4.jar has an incorrect method signature that 
> includes the shaded class as an argument:
> Expected:
> javap -classpath 
> target/scala-2.10/googlegenomics-spark-examples-assembly-1.0.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(com.google.common.base.Function);
> Found:
> javap -classpath lib/spark-assembly-1.2.0-hadoop1.0.4.jar 
> com.google.common.base.Optional
>   public abstract  
> com.google.common.base.Optional 
> transform(org.spark-project.guava.common.base.Function);



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4986) Graceful shutdown for Spark Streaming does not work in Standalone cluster mode

2015-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263697#comment-14263697
 ] 

Apache Spark commented on SPARK-4986:
-

User 'cleaton' has created a pull request for this issue:
https://github.com/apache/spark/pull/3868

> Graceful shutdown for Spark Streaming does not work in Standalone cluster mode
> --
>
> Key: SPARK-4986
> URL: https://issues.apache.org/jira/browse/SPARK-4986
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Jesper Lundgren
>
> When using the graceful stop API of Spark Streaming in Spark Standalone 
> cluster the stop signal never reaches the receivers. I have tested this with 
> Spark 1.2 and Kafka receivers. 
> ReceiverTracker will send StopReceiver message to ReceiverSupervisorImpl.
> In local mode ReceiverSupervisorImpl receives this message but in Standalone 
> cluster mode the message seems to be lost.
> (I have modified the code to send my own string message as a stop signal from 
> ReceiverTracker to ReceiverSupervisorImpl and it works as a workaround.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5064) GraphX rmatGraph hangs

2015-01-03 Thread Michael Malak (JIRA)
Michael Malak created SPARK-5064:


 Summary: GraphX rmatGraph hangs
 Key: SPARK-5064
 URL: https://issues.apache.org/jira/browse/SPARK-5064
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.2.0
 Environment: CentOS 7 REPL (no HDFS). Also tried Cloudera 5.2.0 
QuickStart standalone compiled Scala with spark-submit.
Reporter: Michael Malak


org.apache.spark.graphx.util.GraphGenerators.rmatGraph(sc, 4, 8)

It just outputs "0 edges" and then locks up.

A spark-user message reports similar behavior:
http://mail-archives.apache.org/mod_mbox/spark-user/201408.mbox/%3c1408617621830-12570.p...@n3.nabble.com%3E




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hanhonggen updated SPARK-5001:
--
Comment: was deleted

(was: I think it's impossible to make sure that all jobs generated in spark 
streaming will finish orderly. My patch may be not a proper way. But the 
current logic of Spark Streaming is too rough.)

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263701#comment-14263701
 ] 

hanhonggen commented on SPARK-5001:
---

I think it's impossible to make sure that all jobs generated in spark streaming 
will finish orderly. My patch may be not a proper way. But the current logic of 
Spark Streaming is too rough.

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263702#comment-14263702
 ] 

hanhonggen commented on SPARK-5001:
---

I think it's impossible to make sure that all jobs generated in spark streaming 
will finish orderly. My patch may be not a proper way. But the current logic of 
Spark Streaming is too rough.

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263699#comment-14263699
 ] 

hanhonggen commented on SPARK-5001:
---

I think it's impossible to make sure that all jobs generated in spark streaming 
will finish orderly. My patch may be not a proper way. But the current logic of 
Spark Streaming is too rough.

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263700#comment-14263700
 ] 

hanhonggen commented on SPARK-5001:
---

I think it's impossible to make sure that all jobs generated in spark streaming 
will finish orderly. My patch may be not a proper way. But the current logic of 
Spark Streaming is too rough.

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hanhonggen updated SPARK-5001:
--
Comment: was deleted

(was: I think it's impossible to make sure that all jobs generated in spark 
streaming will finish orderly. My patch may be not a proper way. But the 
current logic of Spark Streaming is too rough.)

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread hanhonggen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hanhonggen updated SPARK-5001:
--
Comment: was deleted

(was: I think it's impossible to make sure that all jobs generated in spark 
streaming will finish orderly. My patch may be not a proper way. But the 
current logic of Spark Streaming is too rough.)

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5001) BlockRDD removed unreasonablly in streaming

2015-01-03 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263704#comment-14263704
 ] 

Saisai Shao commented on SPARK-5001:


Yeah, It's Spark Streaming design model's restriction, for some operators like 
window operators this restriction is meaningful. Also there're some tuning 
guides to point out how to choose batch interval to keep Spark Streaming 
cluster in good status, I think it is easier to tune the app rather than change 
the logical of Spark Streaming IIUC :). 

> BlockRDD removed unreasonablly in streaming
> ---
>
> Key: SPARK-5001
> URL: https://issues.apache.org/jira/browse/SPARK-5001
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: hanhonggen
> Attachments: 
> fix_bug_BlockRDD_removed_not_reasonablly_in_streaming.patch
>
>
> I've counted messages using kafkainputstream of spark-1.1.1. The test app 
> failed when the latter batch job completed sooner than the previous. In the 
> source code, BlockRDDs older than (time-rememberDuration) will be removed in 
> cleanMetaData after one job completed. And the previous job will abort due to 
> block not found.The relevant log are as follows:
> 2014-12-25 
> 14:07:12(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487632000 ms.0 from job set of time 
> 1419487632000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-14] INFO 
> :Starting job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-15] INFO 
> :Finished job streaming job 1419487635000 ms.0 from job set of time 
> 1419487635000 ms
> 2014-12-25 
> 14:07:15(Logging.scala:59)[sparkDriver-akka.actor.default-dispatcher-16] INFO 
> :Removing blocks of RDD BlockRDD[3028] at createStream at TestKafka.java:144 
> of time 1419487635000 ms from DStream clearMetadata
> java.lang.Exception: Could not compute split, block input-0-1419487631400 not 
> found for 3028



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5065) BroadCast can still work after sc had been stopped.

2015-01-03 Thread SaintBacchus (JIRA)
SaintBacchus created SPARK-5065:
---

 Summary: BroadCast can still work after sc had been stopped.
 Key: SPARK-5065
 URL: https://issues.apache.org/jira/browse/SPARK-5065
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.1
Reporter: SaintBacchus


Code as follow:
{code:borderStyle=solid}
val sc1 = new SparkContext
val sc2 = new SparkContext
sc1.broadcast(1)
sc1.stop
{code}
It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
To fix it, throw a sparkException when broadCastManager had stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5065) BroadCast can still work after sc had been stopped.

2015-01-03 Thread SaintBacchus (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SaintBacchus updated SPARK-5065:

Description: 
Code as follow:
{code:borderStyle=solid}
val sc1 = new SparkContext
val sc2 = new SparkContext
sc1.stop
sc1.broadcast(1)
{code}
It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
To fix it, throw a sparkException when broadCastManager had stopped.

  was:
Code as follow:
{code:borderStyle=solid}
val sc1 = new SparkContext
val sc2 = new SparkContext
sc1.broadcast(1)
sc1.stop
{code}
It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
To fix it, throw a sparkException when broadCastManager had stopped.


> BroadCast can still work after sc had been stopped.
> ---
>
> Key: SPARK-5065
> URL: https://issues.apache.org/jira/browse/SPARK-5065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: SaintBacchus
>
> Code as follow:
> {code:borderStyle=solid}
> val sc1 = new SparkContext
> val sc2 = new SparkContext
> sc1.stop
> sc1.broadcast(1)
> {code}
> It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
> To fix it, throw a sparkException when broadCastManager had stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5065) BroadCast can still work after sc had been stopped.

2015-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263729#comment-14263729
 ] 

Apache Spark commented on SPARK-5065:
-

User 'SaintBacchus' has created a pull request for this issue:
https://github.com/apache/spark/pull/3885

> BroadCast can still work after sc had been stopped.
> ---
>
> Key: SPARK-5065
> URL: https://issues.apache.org/jira/browse/SPARK-5065
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.1
>Reporter: SaintBacchus
>
> Code as follow:
> {code:borderStyle=solid}
> val sc1 = new SparkContext
> val sc2 = new SparkContext
> sc1.stop
> sc1.broadcast(1)
> {code}
> It can work well, because sc1.broadcast will reuse the BlockManager in sc2.
> To fix it, throw a sparkException when broadCastManager had stopped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5066) Can not get all value that has the same key when reading key ordered from different Streaming.

2015-01-03 Thread DoingDone9 (JIRA)
DoingDone9 created SPARK-5066:
-

 Summary: Can not get all value that has the same key  when reading 
key ordered  from different Streaming.
 Key: SPARK-5066
 URL: https://issues.apache.org/jira/browse/SPARK-5066
 Project: Spark
  Issue Type: Bug
Reporter: DoingDone9
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5066) Can not get all key when reading key ordered from different Streaming.

2015-01-03 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5066:
--
Summary: Can not get all key  when reading key ordered  from different 
Streaming.  (was: Can not get all value that has the same key  when reading key 
ordered  from different Streaming.)

> Can not get all key  when reading key ordered  from different Streaming.
> 
>
> Key: SPARK-5066
> URL: https://issues.apache.org/jira/browse/SPARK-5066
> Project: Spark
>  Issue Type: Bug
>Reporter: DoingDone9
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.

2015-01-03 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5066:
--
Summary: Can not get all key that has same hashcode  when reading key 
ordered  from different Streaming.  (was: Can not get all key  when reading key 
ordered  from different Streaming.)

> Can not get all key that has same hashcode  when reading key ordered  from 
> different Streaming.
> ---
>
> Key: SPARK-5066
> URL: https://issues.apache.org/jira/browse/SPARK-5066
> Project: Spark
>  Issue Type: Bug
>Reporter: DoingDone9
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.

2015-01-03 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5066:
--
Description: 
when spill is open, data ordered by hashCode will be spilled to disk. We need 
get all key that has the same hashCode from different tmp files when merge 
value, but it just read the key that has the minHashCode that in a tmp file, we 
can not read all key.
Example :
If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
And hashcode of k4 < hashcode of k5 < hashcode of k1 <  hashcode of k2 <  
hashcode of k3
we just  read k1 from file1 and k4 from file2. Can not read all k1.

Code :

private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => 
it.buffered)

inputStreams.foreach { it =>
  val kcPairs = new ArrayBuffer[(K, C)]
  readNextHashCode(it, kcPairs)
  if (kcPairs.length > 0) {
mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
  }
}

 private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
ArrayBuffer[(K, C)]): Unit = {
  if (it.hasNext) {
var kc = it.next()
buf += kc
val minHash = hashKey(kc)
while (it.hasNext && it.head._1.hashCode() == minHash) {
  kc = it.next()
  buf += kc
}
  }
}



> Can not get all key that has same hashcode  when reading key ordered  from 
> different Streaming.
> ---
>
> Key: SPARK-5066
> URL: https://issues.apache.org/jira/browse/SPARK-5066
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: DoingDone9
>Priority: Critical
>
> when spill is open, data ordered by hashCode will be spilled to disk. We need 
> get all key that has the same hashCode from different tmp files when merge 
> value, but it just read the key that has the minHashCode that in a tmp file, 
> we can not read all key.
> Example :
> If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
> And hashcode of k4 < hashcode of k5 < hashcode of k1 <  hashcode of k2 <  
> hashcode of k3
> we just  read k1 from file1 and k4 from file2. Can not read all k1.
> Code :
> private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => 
> it.buffered)
> inputStreams.foreach { it =>
>   val kcPairs = new ArrayBuffer[(K, C)]
>   readNextHashCode(it, kcPairs)
>   if (kcPairs.length > 0) {
> mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
>   }
> }
>  private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
> ArrayBuffer[(K, C)]): Unit = {
>   if (it.hasNext) {
> var kc = it.next()
> buf += kc
> val minHash = hashKey(kc)
> while (it.hasNext && it.head._1.hashCode() == minHash) {
>   kc = it.next()
>   buf += kc
> }
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5066) Can not get all key that has same hashcode when reading key ordered from different Streaming.

2015-01-03 Thread DoingDone9 (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DoingDone9 updated SPARK-5066:
--
Affects Version/s: 1.2.0

> Can not get all key that has same hashcode  when reading key ordered  from 
> different Streaming.
> ---
>
> Key: SPARK-5066
> URL: https://issues.apache.org/jira/browse/SPARK-5066
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: DoingDone9
>Priority: Critical
>
> when spill is open, data ordered by hashCode will be spilled to disk. We need 
> get all key that has the same hashCode from different tmp files when merge 
> value, but it just read the key that has the minHashCode that in a tmp file, 
> we can not read all key.
> Example :
> If file1 has [k1, k2, k3], file2 has [k4,k5,k1].
> And hashcode of k4 < hashcode of k5 < hashcode of k1 <  hashcode of k2 <  
> hashcode of k3
> we just  read k1 from file1 and k4 from file2. Can not read all k1.
> Code :
> private val inputStreams = (Seq(sortedMap) ++ spilledMaps).map(it => 
> it.buffered)
> inputStreams.foreach { it =>
>   val kcPairs = new ArrayBuffer[(K, C)]
>   readNextHashCode(it, kcPairs)
>   if (kcPairs.length > 0) {
> mergeHeap.enqueue(new StreamBuffer(it, kcPairs))
>   }
> }
>  private def readNextHashCode(it: BufferedIterator[(K, C)], buf: 
> ArrayBuffer[(K, C)]): Unit = {
>   if (it.hasNext) {
> var kc = it.next()
> buf += kc
> val minHash = hashKey(kc)
> while (it.hasNext && it.head._1.hashCode() == minHash) {
>   kc = it.next()
>   buf += kc
> }
>   }
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId

2015-01-03 Thread Shixiong Zhu (JIRA)
Shixiong Zhu created SPARK-5067:
---

 Summary: testTaskInfo doesn't compare 
SparkListenerApplicationStart.appId
 Key: SPARK-5067
 URL: https://issues.apache.org/jira/browse/SPARK-5067
 Project: Spark
  Issue Type: Test
Reporter: Shixiong Zhu
Priority: Minor


In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare 
SparkListenerApplicationStart.appId when comparing two 
"SparkListenerApplicationStart"s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5058) Typos and broken URL

2015-01-03 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-5058.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

> Typos and broken URL
> 
>
> Key: SPARK-5058
> URL: https://issues.apache.org/jira/browse/SPARK-5058
> Project: Spark
>  Issue Type: Documentation
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: AkhlD
>Priority: Minor
> Fix For: 1.3.0, 1.2.1
>
>
> Spark Streaming + Kafka Integration Guide has a broken Examples link. Also 
> project is spelled as projrect.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId

2015-01-03 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-5067:

Component/s: Spark Core

> testTaskInfo doesn't compare SparkListenerApplicationStart.appId
> 
>
> Key: SPARK-5067
> URL: https://issues.apache.org/jira/browse/SPARK-5067
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Priority: Minor
>
> In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare 
> SparkListenerApplicationStart.appId when comparing two 
> "SparkListenerApplicationStart"s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5067) testTaskInfo doesn't compare SparkListenerApplicationStart.appId

2015-01-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263748#comment-14263748
 ] 

Apache Spark commented on SPARK-5067:
-

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/3886

> testTaskInfo doesn't compare SparkListenerApplicationStart.appId
> 
>
> Key: SPARK-5067
> URL: https://issues.apache.org/jira/browse/SPARK-5067
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Reporter: Shixiong Zhu
>Priority: Minor
>
> In org.apache.spark.util.JsonProtocolSuite.testTaskInfo, it doesn't compare 
> SparkListenerApplicationStart.appId when comparing two 
> "SparkListenerApplicationStart"s



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3306) Addition of external resource dependency in executors

2015-01-03 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263761#comment-14263761
 ] 

koert kuipers commented on SPARK-3306:
--

i am also interested in resources that can be shared across tasks inside the 
executor, with a way to close/cleanup the resources when the executor shuts down

> Addition of external resource dependency in executors
> -
>
> Key: SPARK-3306
> URL: https://issues.apache.org/jira/browse/SPARK-3306
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Yan
>
> Currently, Spark executors only support static and read-only external 
> resources of side files and jar files. With emerging disparate data sources, 
> there is a need to support more versatile external resources, such as 
> connections to data sources, to facilitate efficient data accesses to the 
> sources. For one, the JDBCRDD, with some modifications,  could benefit from 
> this feature by reusing established JDBC connections from the same Spark 
> context before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)
jeanlyn created SPARK-5068:
--

 Summary: When the path not found in the hdfs,we can't get the 
result
 Key: SPARK-5068
 URL: https://issues.apache.org/jira/browse/SPARK-5068
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0
Reporter: jeanlyn


when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive> show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10`l in  **hive**,i got 
no problem,but when i run in spark-sql i get the error as follow:

```
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
```




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Description: 
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive> show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10` in  **hive**,i got no 
problem,but when i run in spark-sql i get the error as follow:

```
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
```


  was:
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive> show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when i run the sq `select * from partition_test limit 10`l in  **hiv

[jira] [Updated] (SPARK-5068) When the path not found in the hdfs,we can't get the result

2015-01-03 Thread jeanlyn (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jeanlyn updated SPARK-5068:
---
Description: 
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
{noformat}
hive> show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
{noformat}

{noformat}
hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
{noformat}
when i run the sql 
{noformat}
select * from partition_test limit 10
{noformat} in  *hive*,i got no problem,but when i run in *spark-sql* i get the 
error as follow:

{noformat}
Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: 
Input path does not exist: 
hdfs://jeanlyn:9000/user/jeanlyn/warehouse/partition_test/dt=2
at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:251)
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:270)
at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at 
org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at org.apache.spark.rdd.UnionRDD$$anonfun$1.apply(UnionRDD.scala:66)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.rdd.UnionRDD.getPartitions(UnionRDD.scala:66)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.rdd.MappedRDD.getPartitions(MappedRDD.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:205)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:203)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:203)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1328)
at org.apache.spark.rdd.RDD.collect(RDD.scala:780)
at 
org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:84)
at org.apache.spark.sql.SchemaRDD.collect(SchemaRDD.scala:444)
at org.apache.spark.sql.hive.testpartition$.main(test.scala:23)
at org.apache.spark.sql.hive.testpartition.main(test.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:134)
{noformat}


  was:
when the partion path was found in the metastore but not found in the hdfs,it 
will casue some problems as follow:
```
hive> show partitions partition_test;
OK
dt=1
dt=2
dt=3
dt=4
Time taken: 0.168 seconds, Fetched: 4 row(s)
```

```
hive> dfs -ls /user/jeanlyn/warehouse/partition_test;
Found 3 items
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=1
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 16:29 
/user/jeanlyn/warehouse/partition_test/dt=3
drwxr-xr-x   - jeanlyn supergroup  0 2014-12-02 17:42 
/user/jeanlyn/warehouse/partition_test/dt=4
```
when