date:20140828

[jira] [Assigned] (SPARK-3212) Improve the clarity of caching semantics

2014-08-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3212:
---

Assignee: Michael Armbrust

> Improve the clarity of caching semantics
> 
>
> Key: SPARK-3212
> URL: https://issues.apache.org/jira/browse/SPARK-3212
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>Priority: Blocker
>
> Right now there are a bunch of different ways to cache tables in Spark SQL. 
> For example:
>  - tweets.cache()
>  - sql("SELECT * FROM tweets").cache()
>  - table("tweets").cache()
>  - tweets.cache().registerTempTable(tweets)
>  - sql("CACHE TABLE tweets")
>  - cacheTable("tweets")
> Each of the above commands has subtly different semantics, leading to a very 
> confusing user experience.  Ideally, we would stop doing caching based on 
> simple tables names and instead have a phase of optimization that does 
> intelligent matching of query plans with available cached data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3295) [Spark SQL] schemaRdd1 ++ schemaRdd2 does not return another SchemaRdd

2014-08-28 Thread Evan Chan (JIRA)

Evan Chan created SPARK-3295:


 Summary: [Spark SQL] schemaRdd1 ++ schemaRdd2  does not return 
another SchemaRdd
 Key: SPARK-3295
 URL: https://issues.apache.org/jira/browse/SPARK-3295
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.2
Reporter: Evan Chan
Priority: Minor


Right now, 

schemaRdd1.unionAll(schemaRdd2) returns a SchemaRdd.

However,

schemaRdd1 ++ schemaRdd2 returns an RDD[Row].
Similarly,
schemaRdd1.union(schemaRdd2) returns an RDD[Row].

This is inconsistent.  Let's make ++ and union have the same behavior as 
unionAll.

Actually, not sure there needs to be both union and unionAll.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3279) Remove useless field variable in ApplicationMaster

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3279.
---

   Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 2177
[https://github.com/apache/spark/pull/2177]

> Remove useless field variable in ApplicationMaster
> --
>
> Key: SPARK-3279
> URL: https://issues.apache.org/jira/browse/SPARK-3279
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Kousuke Saruta
> Fix For: 1.2.0
>
>
> ApplicationMaster no longer use "ALLOCATE_HEARTBEAT_INTERVAL".
> Let's remove it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-08-28 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-3292:
--

Description: 
such as repartition groupby join and cogroup
for example. 
if i want the shuffle outputs save as hadoop file ,even though  there is no 
inputs , many emtpy file generate too.
it's too expensive , 

  was:
such as repartition groupby join and cogroup
it's too expensive , 
for example. if i want the shuffle outputs save as hadoop file ,even though  
there is no inputs , many emtpy file generate 


> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> for example. 
> if i want the shuffle outputs save as hadoop file ,even though  there is no 
> inputs , many emtpy file generate too.
> it's too expensive , 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-08-28 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-3292:
--

Description: 
such as repartition groupby join and cogroup
it's too expensive , 
for example. if i want the shuffle outputs save as hadoop file ,even though  
there is no inputs , many emtpy file generate 

  was:
such as repartition groupby join and cogroup
it's too expensive , 
for example if i want outputs save as hadoop file ,then many emtpy file 
generate.


> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> it's too expensive , 
> for example. if i want the shuffle outputs save as hadoop file ,even though  
> there is no inputs , many emtpy file generate 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114881#comment-14114881
 ] 

Colin B. edited comment on SPARK-3266 at 8/29/14 5:38 AM:
--

Broken for JavaDoubleRDD: fold, reduce, min, max
{code}
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.fold(Ljava/lang/Double;Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.min(Ljava/util/Comparator;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
{code}

The "first" method would also be affected, but it seems that [~joshrosen] fixed 
that when he first implemented JavaDoubleRDD.

Also, I would bet JavaPairRDD and JavaSchemaRDD have similar issues.


was (Author: lanzaa):
Broken for JavaDoubleRDD: fold, reduce, min, max
{code}
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.fold(Ljava/lang/Double;Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.min(Ljava/util/Comparator;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
{code}

The "first" method would also be affected, but it seems that [~joshrosen] fixed 
that when he first implemented JavaDoubleRDD.

Also, I would bet JavaPairRDD.scala and JavaSchemaRDD.scala have similar issues.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-08-28 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-3292:
--

Description: 
such as repartition groupby join and cogroup
it's too expensive , 
for example if i want outputs save as hadoop file ,then many emtpy file 
generate.

  was:
such as repartition groupby join and cogroup
it's too expensive , for example if i want outputs save as hadoop file ,then 
many emtpy file generate.


> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> it's too expensive , 
> for example if i want outputs save as hadoop file ,then many emtpy file 
> generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114881#comment-14114881
 ] 

Colin B. commented on SPARK-3266:
-

Broken for JavaDoubleRDD: fold, reduce, min, max
{code}
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.fold(Ljava/lang/Double;Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.reduce(Lorg/apache/spark/api/java/function/Function2;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.min(Ljava/util/Comparator;)Ljava/lang/Double;
java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
{code}

The "first" method would also be affected, but it seems that [~joshrosen] fixed 
that when he first implemented JavaDoubleRDD.

Also, I would bet JavaPairRDD.scala and JavaSchemaRDD.scala have similar issues.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3292) Shuffle Tasks run incessantly even though there's no inputs

2014-08-28 Thread guowei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

guowei updated SPARK-3292:
--

Summary: Shuffle Tasks run incessantly even though there's no inputs  (was: 
Shuffle Tasks run indefinitely even though there's no inputs)

> Shuffle Tasks run incessantly even though there's no inputs
> ---
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> it's too expensive , for example if i want outputs save as hadoop file ,then 
> many emtpy file generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Burak Yavuz (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114873#comment-14114873
 ] 

Burak Yavuz commented on SPARK-3280:


I don't have as detailed a comparison like Josh has, but for MLlib algorithms, 
sort based shuffle didn't show the performance boosts Josh has shown. 16 
m3.2xlarge instances were used for these experiments. The difference here is 
that the number of partitions I used were 128. Much less than the number of 
partitions Josh has shown.

!hash-sort-comp.png!

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Burak Yavuz (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-3280:
---

Attachment: hash-sort-comp.png

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: hash-sort-comp.png
>
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114870#comment-14114870
 ] 

Josh Rosen commented on SPARK-3266:
---

[~pwendell] I think the JavaRDDLike trait compiles down to a Java interface 
(named JavaRDDLike) and an abstract class named JavaRDDLike$class that contains 
the implementations of the trait's members.  After this change, I think 
JavaRDDLike would compile into a Java abstract base class with the same name 
and we wouldn't have a separate interface.

My concern here is that it's going to be a _huge_ pain to find and fix all of 
the possible issues that could be caused by this being a trait instead of an 
abstract base class.  Having it be a trait was a mistake that we should have 
caught and fixed earlier.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage

2014-08-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-3294:


Assignee: Cheng Lian

> Avoid boxing/unboxing when handling in-memory columnar storage
> --
>
> Key: SPARK-3294
> URL: https://issues.apache.org/jira/browse/SPARK-3294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>Priority: Critical
>
> When Spark SQL in-memory columnar storage was implemented, we tried to avoid 
> boxing/unboxing costs as much as possible, but {{javap}} shows that there 
> still exist code that involves boxing/unboxing on critical paths due to type 
> erasure, especially methods of sub-classes of {{ColumnType}}. Should 
> eliminate them whenever possible for better performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2219) AddJar doesn't work

2014-08-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2219:


Assignee: Cheng Lian  (was: Michael Armbrust)

> AddJar doesn't work
> ---
>
> Key: SPARK-2219
> URL: https://issues.apache.org/jira/browse/SPARK-2219
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheng Lian
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job

2014-08-28 Thread Michael Armbrust (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2973:


Assignee: Cheng Lian

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage

2014-08-28 Thread Cheng Lian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114836#comment-14114836
 ] 

Cheng Lian commented on SPARK-3294:
---

[~marmbrus], would you mind to assign this issue to me?

> Avoid boxing/unboxing when handling in-memory columnar storage
> --
>
> Key: SPARK-3294
> URL: https://issues.apache.org/jira/browse/SPARK-3294
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.0.2, 1.1.0
>Reporter: Cheng Lian
>Priority: Critical
>
> When Spark SQL in-memory columnar storage was implemented, we tried to avoid 
> boxing/unboxing costs as much as possible, but {{javap}} shows that there 
> still exist code that involves boxing/unboxing on critical paths due to type 
> erasure, especially methods of sub-classes of {{ColumnType}}. Should 
> eliminate them whenever possible for better performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3294) Avoid boxing/unboxing when handling in-memory columnar storage

2014-08-28 Thread Cheng Lian (JIRA)

Cheng Lian created SPARK-3294:
-

 Summary: Avoid boxing/unboxing when handling in-memory columnar 
storage
 Key: SPARK-3294
 URL: https://issues.apache.org/jira/browse/SPARK-3294
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.2, 1.1.0
Reporter: Cheng Lian
Priority: Critical


When Spark SQL in-memory columnar storage was implemented, we tried to avoid 
boxing/unboxing costs as much as possible, but {{javap}} shows that there still 
exist code that involves boxing/unboxing on critical paths due to type erasure, 
especially methods of sub-classes of {{ColumnType}}. Should eliminate them 
whenever possible for better performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114829#comment-14114829
 ] 

Sean Owen commented on SPARK-3292:
--

Can you elaborate this? it's not clear whether you're reporting that the 
process hangs, runs slowly, or creates too many files.

> Shuffle Tasks run indefinitely even though there's no inputs
> 
>
> Key: SPARK-3292
> URL: https://issues.apache.org/jira/browse/SPARK-3292
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Affects Versions: 1.0.2
>Reporter: guowei
>
> such as repartition groupby join and cogroup
> it's too expensive , for example if i want outputs save as hadoop file ,then 
> many emtpy file generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3293) yarn's web show "SUCCEEDED" when the driver throw a exception in yarn-client

2014-08-28 Thread wangfei (JIRA)

wangfei created SPARK-3293:
--

 Summary: yarn's web show "SUCCEEDED" when the driver throw a 
exception in yarn-client
 Key: SPARK-3293
 URL: https://issues.apache.org/jira/browse/SPARK-3293
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: wangfei
 Fix For: 1.1.0


If an exception occurs, the yarn'web->Applications->FinalStatus will also be 
the "SUCCEEDED" without the expectation of "FAILED".
In the release of spark-1.0.2, only yarn-client mode will show this.
But recently the yarn-cluster mode will also be a problem.

To reply this:
just new a sparkContext and then throw an exception
then watch the yarn websit about applications



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3291) TestcaseName in createQueryTest should not contain ":"

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114807#comment-14114807
 ] 

Apache Spark commented on SPARK-3291:
-

User 'chouqin' has created a pull request for this issue:
https://github.com/apache/spark/pull/2191

> TestcaseName in createQueryTest should not contain ":"
> --
>
> Key: SPARK-3291
> URL: https://issues.apache.org/jira/browse/SPARK-3291
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Qiping Li
>
> ":" is not allowed to appear in a file name of Windows system. If file name 
> contains ":", this file can't be checked out in a Windows system and 
> developers using Windows must be careful to not commit the deletion of such 
> files, Which is very inconvenient. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3292) Shuffle Tasks run indefinitely even though there's no inputs

2014-08-28 Thread guowei (JIRA)

guowei created SPARK-3292:
-

 Summary: Shuffle Tasks run indefinitely even though there's no 
inputs
 Key: SPARK-3292
 URL: https://issues.apache.org/jira/browse/SPARK-3292
 Project: Spark
  Issue Type: Improvement
  Components: Streaming
Affects Versions: 1.0.2
Reporter: guowei


such as repartition groupby join and cogroup
it's too expensive , for example if i want outputs save as hadoop file ,then 
many emtpy file generate.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3250) More Efficient Sampling

2014-08-28 Thread Erik Erlandson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114789#comment-14114789
 ] 

Erik Erlandson commented on SPARK-3250:
---

I did some experiments with sampling that models the gaps between samples (so 
one can use iterator.drop between samples).  The results are here:

https://gist.github.com/erikerlandson/66b42d96500589f25553

There appears to be a crossover point in efficiency, around sampling 
probability p=0.3, where densities below 0.3 are best done using the new logic, 
and higher sampling densities are better done using traditional filter-based 
logic.

I need to run more tests, but the first results are promising.  At low sampling 
densities the improvement is large.

> More Efficient Sampling
> ---
>
> Key: SPARK-3250
> URL: https://issues.apache.org/jira/browse/SPARK-3250
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: RJ Nowling
>
> Sampling, as currently implemented in Spark, is an O\(n\) operation.  A 
> number of stochastic algorithms achieve speed ups by exploiting O\(k\) 
> sampling, where k is the number of data points to sample.  Examples of such 
> algorithms include KMeans MiniBatch (SPARK-2308) and Stochastic Gradient 
> Descent with mini batching.
> More efficient sampling may be achievable by packing partitions with an 
> ArrayBuffer or other data structure supporting random access.  Since many of 
> these stochastic algorithms perform repeated rounds of sampling, it may be 
> feasible to perform a transformation to change the backing data structure 
> followed by multiple rounds of sampling.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3291) TestcaseName in createQueryTest should not contain ":"

2014-08-28 Thread Qiping Li (JIRA)

Qiping Li created SPARK-3291:


 Summary: TestcaseName in createQueryTest should not contain ":"
 Key: SPARK-3291
 URL: https://issues.apache.org/jira/browse/SPARK-3291
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Qiping Li


":" is not allowed to appear in a file name of Windows system. If file name 
contains ":", this file can't be checked out in a Windows system and developers 
using Windows must be careful to not commit the deletion of such files, Which 
is very inconvenient. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114742#comment-14114742
 ] 

Joseph K. Bradley commented on SPARK-3272:
--

Hi Qiping, you are right; I missed that!  I like your idea of storing the 
number of instances in the InformationGainStats.  (That seems easier to 
understand than a special invalid gain value.)  For now, I would recommend 
storing the number for the node, not for the left & right child nodes.  That 
would allow you to decide if the node being considered is a leaf (not its 
children).

I agree that, eventually, we should identify if the children are leafs at the 
same time.  That should be part of [SPARK-3158], which could modify 
findBestSplits to return ImpurityCalculators (a new class from my PR 
[https://github.com/apache/spark/pull/2125]) for the left and right child 
nodes.  Does that sound reasonable?

> Calculate prediction for nodes separately from calculating information gain 
> for splits in decision tree
> ---
>
> Key: SPARK-3272
> URL: https://issues.apache.org/jira/browse/SPARK-3272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
> Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with 
> calculation of information gain stats for each possible splits. The value to 
> predict for a specific node is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then 
> calculate information gain stats for each split.
> This is also necessary if we want to support minimum instances per node 
> parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
> because when all splits don't satisfy minimum instances requirement , we 
> don't use information gain of any splits. There should be a way to get the 
> prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2594) Add CACHE TABLE AS SELECT ...

2014-08-28 Thread Michael Armbrust (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114729#comment-14114729
 ] 

Michael Armbrust commented on SPARK-2594:
-

Its a lot of overhead to assign issues to people, but feel free to work on this 
now that you have posted here.  Please post a design here before you begin 
coding.

> Add CACHE TABLE  AS SELECT ...
> 
>
> Key: SPARK-2594
> URL: https://issues.apache.org/jira/browse/SPARK-2594
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3288:
-

Affects Version/s: 1.1.0

> All fields in TaskMetrics should be private and use getters/setters
> ---
>
> Key: SPARK-3288
> URL: https://issues.apache.org/jira/browse/SPARK-3288
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Patrick Wendell
>Assignee: Andrew Or
>
> This is particularly bad because we expose this as a developer API. 
> Technically a library could create a TaskMetrics object and then change the 
> values inside of it and pass it onto someone else. It can be written pretty 
> compactly like below:
> {code}
>   /**
>* Number of bytes written for the shuffle by this task
>*/
>   @volatile private var _shuffleBytesWritten: Long = _
>   def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += 
> value
>   def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= 
> value
>   def shuffleBytesWritten = _shuffleBytesWritten
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3277:
-

Fix Version/s: 1.1.0

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3277:
-

Affects Version/s: (was: 1.2.0)

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3234) SPARK_HADOOP_VERSION doesn't have a valid value by default in make-distribution.sh

2014-08-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3234:
---

Target Version/s: 1.2.0

> SPARK_HADOOP_VERSION doesn't have a valid value by default in 
> make-distribution.sh 
> ---
>
> Key: SPARK-3234
> URL: https://issues.apache.org/jira/browse/SPARK-3234
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Priority: Minor
>
> {{SPARK_HADOOP_VERSION}} has already been deprecated, but 
> {{make-distribution.sh}} uses it as part of the distribution tarball name. As 
> a result, we end up with something like {{spark-1.1.0-SNAPSHOT-bin-.tgz}} 
> because {{SPARK_HADOOP_VERSION}} is empty.
> A possible fix is to add the antrun plugin into the Maven build and run Maven 
> to print {{$hadoop.version}}. Instructions can be found in [this 
> post|http://www.avajava.com/tutorials/lessons/how-do-i-display-the-value-of-a-property.html].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3245) spark insert into hbase class not serialize

2014-08-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3245.


Resolution: Invalid

I'm closing this for now because we typically reported only isolated issues on 
the JIRA. Feel free to ping the spark user list for help narrowing down the 
issue.

> spark insert into hbase  class not serialize
> 
>
> Key: SPARK-3245
> URL: https://issues.apache.org/jira/browse/SPARK-3245
> Project: Spark
>  Issue Type: Bug
> Environment: spark-1.0.1 + hbase-0.96.2 + hadoop-2.2.0
>Reporter: 刘勇
>
> val result: org.apache.spark.rdd.RDD[(String, Int)]
>  result.foreach(res =>{
>   var put = new 
> Put(java.util.UUID.randomUUID().toString.reverse.getBytes())
>.add("lv6".getBytes(), res._1.toString.getBytes(), 
> res._2.toString.getBytes)
>   table.put(put) 
>   } 
>   )
> Exception in thread "Thread-3" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:186)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task not serializable: java.io.NotSerializableException: 
> org.apache.hadoop.hbase.client.HTablePool$PooledHTable
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:771)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:901)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16$$anonfun$apply$1.apply(DAGScheduler.scala:898)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:898)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$16.apply(DAGScheduler.scala:897)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:897)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1226)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3200) Class defined with reference to external variables crashes in REPL.

2014-08-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3200?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3200:
---

Component/s: Spark Shell

> Class defined with reference to external variables crashes in REPL.
> ---
>
> Key: SPARK-3200
> URL: https://issues.apache.org/jira/browse/SPARK-3200
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Shell
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Prashant Sharma
>
> Reproducer:
> {noformat}
> val a = sc.textFile("README.md").count
> case class A(i: Int) { val j = a} 
> sc.parallelize(1 to 10).map(A(_)).collect()
> {noformat}
> This will happen, when one refers something that refers sc and not otherwise. 
> There are many ways to work around this, like directly assign a constant 
> value instead of referring the variable. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2636) Expose job ID in JobWaiter API

2014-08-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2636:
---

Summary: Expose job ID in JobWaiter API  (was: no where to get job 
identifier while submit spark job through spark API)

> Expose job ID in JobWaiter API
> --
>
> Key: SPARK-2636
> URL: https://issues.apache.org/jira/browse/SPARK-2636
> Project: Spark
>  Issue Type: New Feature
>  Components: Java API
>Reporter: Chengxiang Li
>  Labels: hive
>
> In Hive on Spark, we want to track spark job status through Spark API, the 
> basic idea is as following:
> # create an hive-specified spark listener and register it to spark listener 
> bus.
> # hive-specified spark listener generate job status by spark listener events.
> # hive driver track job status through hive-specified spark listener. 
> the current problem is that hive driver need job identifier to track 
> specified job status through spark listener, but there is no spark API to get 
> job identifier(like job id) while submit spark job.
> I think other project whoever try to track job status with spark API would 
> suffer from this as well.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Patrick Wendell (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114680#comment-14114680
 ] 

Patrick Wendell commented on SPARK-3266:


[~joshrosen] is there a solution here that preserves binary compatibility? 
That's been our goal at this point and we've maintained it by and large except 
for a few very minor mandatory Scala 2.11 upgrades.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2961) Use statistics to skip partitions when reading from in-memory columnar data

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114668#comment-14114668
 ] 

Apache Spark commented on SPARK-2961:
-

User 'liancheng' has created a pull request for this issue:
https://github.com/apache/spark/pull/2188

> Use statistics to skip partitions when reading from in-memory columnar data
> ---
>
> Key: SPARK-2961
> URL: https://issues.apache.org/jira/browse/SPARK-2961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3290) No unpersist callls in SVDPlusPlus

2014-08-28 Thread Dou Wenjuan (JIRA)

Dou Wenjuan created SPARK-3290:
--

 Summary: No unpersist callls in SVDPlusPlus
 Key: SPARK-3290
 URL: https://issues.apache.org/jira/browse/SPARK-3290
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.2
Reporter: Dou Wenjuan


The implementation of SVDPlusPlus will cache graph produced by each iteration 
and do not unpersist them, so as iteration goes on, more and more useless graph 
will be cached and out of memory happens.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian resolved SPARK-2970.
---

Resolution: Fixed

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
>Priority: Critical
> Fix For: 1.1.0
>
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
> shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
> When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
> try to create a file to mark the application finished but the hook of 
> FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Patrick Wendell (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3277.


Resolution: Fixed

Fixed by https://github.com/apache/spark/pull/2187

Thanks to everyone who helped isolate and debug this.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114612#comment-14114612
 ] 

Benoy Antony commented on SPARK-3287:
-

I'll submit a git pull request.

> When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
> not displayed.
> 
>
> Key: SPARK-3287
> URL: https://issues.apache.org/jira/browse/SPARK-3287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3287.patch
>
>
> When ResourceManager High Availability is enabled, there will be multiple 
> resource managers and each of them could act as a proxy.
> AmIpFilter is modified to accept multiple proxy hosts. But Spark 
> ApplicationMaster fails to read the ResourceManager IPs properly from the 
> configuration.
> So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
> to the ApplicationMaster WebUI will be redirected to port RM port on the 
> local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114614#comment-14114614
 ] 

Benoy Antony commented on SPARK-3286:
-

I'll submit a git pull request.

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Mark Hamstra (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114611#comment-14114611
 ] 

Mark Hamstra commented on SPARK-3289:
-

https://github.com/apache/spark/pull/1360

> Avoid job failures due to rescheduling of failing tasks on buggy machines
> -
>
> Key: SPARK-3289
> URL: https://issues.apache.org/jira/browse/SPARK-3289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Some users have reported issues where a task fails due to an environment / 
> configuration issue on some machine, then the task is reattempted _on that 
> same buggy machine_ until the entire job failures because that single task 
> has failed too many times.
> To guard against this, maybe we should add some randomization in how we 
> reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Qiping Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114591#comment-14114591
 ] 

Qiping Li commented on SPARK-3272:
--

Hi Joseph, thanks for your comment, I think checking the number of instances 
can't be done in the train() method because we don't know the number of 
instances for the leftSplit or rightSplit, for each split, we can only get 
information from InformationGainStats, which doesn't contain number of 
instances. In my implementation of SPARK-2207, the check is done in 
calculateGainForSplit, when the check fails, return a invalid information gain, 
the calculation of predict value may be skipped in that case. 

Maybe we can include number of instances for leftSplit and rightSplit in 
information gain stats and calculate predict value no matter whether check 
passes or not. I think either is fine for me.

> Calculate prediction for nodes separately from calculating information gain 
> for splits in decision tree
> ---
>
> Key: SPARK-3272
> URL: https://issues.apache.org/jira/browse/SPARK-3272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
> Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with 
> calculation of information gain stats for each possible splits. The value to 
> predict for a specific node is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then 
> calculate information gain stats for each split.
> This is also necessary if we want to support minimum instances per node 
> parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
> because when all splits don't satisfy minimum instances requirement , we 
> don't use information gain of any splits. There should be a way to get the 
> prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114551#comment-14114551
 ] 

Apache Spark commented on SPARK-3277:
-

User 'andrewor14' has created a pull request for this issue:
https://github.com/apache/spark/pull/2187

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114484#comment-14114484
 ] 

Mridul Muralidharan commented on SPARK-3277:


Sounds great, thx !
I suspect it is because for lzo we configure it to write block on flush 
(partial if insufficient data to fill block); but for lz4, either such config 
does not exist or we dont use that.
Resulting in flush becoming noop in case the data in current block is 
insufficientto cause a compressed block to be created - while close will force 
patial block to be written out.

Which is why the asserion lists all sizes as 0


> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3289) Avoid job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3289:
--

Summary: Avoid job failures due to rescheduling of failing tasks on buggy 
machines  (was: Prevent complete job failures due to rescheduling of failing 
tasks on buggy machines)

> Avoid job failures due to rescheduling of failing tasks on buggy machines
> -
>
> Key: SPARK-3289
> URL: https://issues.apache.org/jira/browse/SPARK-3289
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Josh Rosen
>
> Some users have reported issues where a task fails due to an environment / 
> configuration issue on some machine, then the task is reattempted _on that 
> same buggy machine_ until the entire job failures because that single task 
> has failed too many times.
> To guard against this, maybe we should add some randomization in how we 
> reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3289) Prevent complete job failures due to rescheduling of failing tasks on buggy machines

2014-08-28 Thread Josh Rosen (JIRA)

Josh Rosen created SPARK-3289:
-

 Summary: Prevent complete job failures due to rescheduling of 
failing tasks on buggy machines
 Key: SPARK-3289
 URL: https://issues.apache.org/jira/browse/SPARK-3289
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Josh Rosen


Some users have reported issues where a task fails due to an environment / 
configuration issue on some machine, then the task is reattempted _on that same 
buggy machine_ until the entire job failures because that single task has 
failed too many times.

To guard against this, maybe we should add some randomization in how we 
reschedule failed tasks.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3190) Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow somewhere

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3190.
---

   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1
   1.2.0

Issue resolved by pull request 2106
[https://github.com/apache/spark/pull/2106]

> Creation of large graph(> 2.15 B nodes) seems to be broken:possible overflow 
> somewhere 
> ---
>
> Key: SPARK-3190
> URL: https://issues.apache.org/jira/browse/SPARK-3190
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 1.0.3
> Environment: Standalone mode running on EC2 . Using latest code from 
> master branch upto commit #db56f2df1b8027171da1b8d2571d1f2ef1e103b6 .
>Reporter: npanj
>Assignee: Ankur Dave
>Priority: Critical
> Fix For: 1.2.0, 1.1.1, 1.0.3
>
>
> While creating a graph with 6B nodes and 12B edges, I noticed that 
> 'numVertices' api returns incorrect result; 'numEdges' reports correct 
> number. For few times(with different dataset > 2.5B nodes) I have also 
> notices that numVertices is returned as -ive number; so I suspect that there 
> is some overflow (may be we are using Int for some field?).
> Here is some details of experiments  I have done so far: 
> 1. Input: numNodes=6101995593 ; noEdges=12163784626
>Graph returns: numVertices=1807028297 ;  numEdges=12163784626
> 2. Input : numNodes=2157586441 ; noEdges=2747322705
>Graph Returns: numVertices=-2137380855 ;  numEdges=2747322705
> 3. Input: numNodes=1725060105 ; noEdges=204176821
>Graph: numVertices=1725060105 ;  numEdges=2041768213
> You can find the code to generate this bug here: 
> https://gist.github.com/npanj/92e949d86d08715bf4bf
> Note: Nodes are labeled are 1...6B .
>  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114453#comment-14114453
 ] 

Apache Spark commented on SPARK-3266:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2186

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: (was: SPARK-3286.patch)

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: SPARK-3286-branch-1-0.patch

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286-branch-1-0.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: SPARK-3286.patch

Attaching the patch for the master

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286.patch, SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3277:
---

Description: 
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  "spark.shuffle.spill" as false to avoid such exeception, but 
once open this "switch", this error would come.
It seems that if num of the[ words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
{code}
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
{code}


  was:
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  "spark.shuffle.spill" as false to avoid such exeception, but 
once open this "switch", this error would come.
It seems that if num of the words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)



> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleM

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3277:
---

Assignee: Andrew Or

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Assignee: Andrew Or
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the[ words is few, wordcount will go through,but if 
> it is a complex text ,this problem will show
> Exeception Info as follow:
> {code}
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3266:
--

Affects Version/s: 1.1.0

JavaRDDLike probably should be an abstract class.  I think the current trait 
implementation was a holdover from an earlier prototype that attempted to 
achieve higher code reuse for operations like map() and filter().

I added a test case to JavaAPISuite that reproduces this issue on master, too.

The simplest solution is probably to make JavaRDDLike into a trait.  I think we 
can do this while maintaining source compatibility.  A less invasive but 
messier solution would be to just copy the implementation of max() and min() 
into each Java*RDD class and remove it from the trait.  

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3260) Yarn - pass acls along with executor launch

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3260?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114408#comment-14114408
 ] 

Apache Spark commented on SPARK-3260:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2185

> Yarn - pass acls along with executor launch
> ---
>
> Key: SPARK-3260
> URL: https://issues.apache.org/jira/browse/SPARK-3260
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.1.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>
> In https://github.com/apache/spark/pull/1196 I added passing the spark view 
> and modify acls into yarn.  Unfortunately we are only passing them into the 
> application master and I missed passing them in when we launch individual 
> containers (executors). 
> We need to modify the ExecutorRunnable.startContainer to set the acls in the 
> ContainerLaunchContext.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Amey Chaugule (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114413#comment-14114413
 ] 

Amey Chaugule commented on SPARK-3266:
--

No worries, I initially assumed my runtime env was old too until i rechecked.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2, 1.1.0
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114391#comment-14114391
 ] 

Sean Owen commented on SPARK-3266:
--

(Mea culpa! The example shows this is a legitimate question. I'll be quiet now.)

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114387#comment-14114387
 ] 

Colin B. commented on SPARK-3266:
-

So there is no method:
{code}
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
{code}
but there is a method:
{code}
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Object;
{code}

I've heard that the return type is part of the type signature in java bytecode, 
so the two are different. (one returns a Double, the other an Object)

This looks a bit like a scala type erasure related issue. The spark/scala code 
generated for JavaRDDLike includes a max method that returns an object. In 
JavaDoubleRDD the type is bounded to Double, so java code which calls max on 
JavaDoubleRDD expects a method returning Double. Since the code for max is 
implemented in the JavaRDDLike trait, the java code doesn't seem to inherit it 
correctly when types are involved.

I tested making JavaRDDLike an abstract class instead of a trait. It was able 
to compile and run correctly. However it is not compatible with 1.0.2.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3183) Add option for requesting full YARN cluster

2014-08-28 Thread Shay Rojansky (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114368#comment-14114368
 ] 

Shay Rojansky commented on SPARK-3183:
--

+1.

As a current workaround for cores, we specify a number well beyond the YARN 
cluster capacity. This gets handled well by Spark/YARN, and we get the entire 
cluster.

> Add option for requesting full YARN cluster
> ---
>
> Key: SPARK-3183
> URL: https://issues.apache.org/jira/browse/SPARK-3183
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Reporter: Sandy Ryza
>
> This could possibly be in the form of --executor-cores ALL --executor-memory 
> ALL --num-executors ALL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114362#comment-14114362
 ] 

Josh Rosen commented on SPARK-3266:
---

Thanks for the reproduction!  I tried it myself and see the same issue.

If I replace 

{code}
JavaDoubleRDD javaDoubleRDD = sc.parallelizeDoubles(numbers);
{code}

with 

{code}
JavaRDDLike javaDoubleRDD = sc.parallelizeDoubles(numbers);
{code}

then it seems to work.  I'll take a closer look using {{javap}} to see if I can 
figure out why this is happening.

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-3266:
--

Affects Version/s: 1.0.2

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3285) Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)

2014-08-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3285.


   Resolution: Fixed
Fix Version/s: 1.2.0

> Using values.sum is easier to understand than using values.foldLeft(0)(_ + _)
> -
>
> Key: SPARK-3285
> URL: https://issues.apache.org/jira/browse/SPARK-3285
> Project: Spark
>  Issue Type: Test
>  Components: Examples
>Affects Versions: 1.0.2
>Reporter: Yadong Qi
> Fix For: 1.2.0
>
>
> def sumB >: A: B = foldLeft(num.zero)(num.plus)
> Using values.sum is easier to understand than using values.foldLeft(0)(_ + ), 
> so we'd better use values.sum instead of values.foldLeft(0)( + _)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3281) Remove Netty specific code in BlockManager

2014-08-28 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3281.


   Resolution: Fixed
Fix Version/s: 1.2.0

> Remove Netty specific code in BlockManager
> --
>
> Key: SPARK-3281
> URL: https://issues.apache.org/jira/browse/SPARK-3281
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.2.0
>
>
> Everything should go through the BlockTransferService interface rather than 
> having conditional branches for Netty.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen reassigned SPARK-3266:
-

Assignee: Josh Rosen

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1
>Reporter: Amey Chaugule
>Assignee: Josh Rosen
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3266) JavaDoubleRDD doesn't contain max()

2014-08-28 Thread Colin B. (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Colin B. updated SPARK-3266:


Attachment: spark-repro-3266.tar.gz

I have attached a simple java project which reproduces the issue. 
[^spark-repro-3266.tar.gz]

{code}
> tar xvzf spark-repro-3266.tar.gz
...
> cd spark-repro-3266
> mvn clean package
> /path/to/spark-1.0.2-bin-hadoop2/bin/spark-submit --class SimpleApp 
> target/testcase-4-1.0.jar
...
Exception in thread "main" java.lang.NoSuchMethodError: 
org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
at SimpleApp.main(SimpleApp.java:17)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

> JavaDoubleRDD doesn't contain max()
> ---
>
> Key: SPARK-3266
> URL: https://issues.apache.org/jira/browse/SPARK-3266
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 1.0.1
>Reporter: Amey Chaugule
> Attachments: spark-repro-3266.tar.gz
>
>
> While I can compile my code, I see:
> Caused by: java.lang.NoSuchMethodError: 
> org.apache.spark.api.java.JavaDoubleRDD.max(Ljava/util/Comparator;)Ljava/lang/Double;
> When I try to execute my Spark code. Stepping into the JavaDoubleRDD class, I 
> don't notice max()
> although it is clearly listed in the documentation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3288) All fields in TaskMetrics should be private and use getters/setters

2014-08-28 Thread Patrick Wendell (JIRA)

Patrick Wendell created SPARK-3288:
--

 Summary: All fields in TaskMetrics should be private and use 
getters/setters
 Key: SPARK-3288
 URL: https://issues.apache.org/jira/browse/SPARK-3288
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Andrew Or


This is particularly bad because we expose this as a developer API. Technically 
a library could create a TaskMetrics object and then change the values inside 
of it and pass it onto someone else. It can be written pretty compactly like 
below:

{code}
  /**
   * Number of bytes written for the shuffle by this task
   */
  @volatile private var _shuffleBytesWritten: Long = _
  def incrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten += value
  def decrementShuffleBytesWritten(value: Long) = _shuffleBytesWritten -= value
  def shuffleBytesWritten = _shuffleBytesWritten
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3254) Streaming K-Means

2014-08-28 Thread Jeremy Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299
 ] 

Jeremy Freeman edited comment on SPARK-3254 at 8/28/14 8:38 PM:


Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/blob/streaming-kmeans/mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala


was (Author: freeman-lab):
Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/tree/streaming-kmeans

> Streaming K-Means
> -
>
> Key: SPARK-3254
> URL: https://issues.apache.org/jira/browse/SPARK-3254
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Jeremy Freeman
>
> Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3254) Streaming K-Means

2014-08-28 Thread Jeremy Freeman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114299#comment-14114299
 ] 

Jeremy Freeman commented on SPARK-3254:
---

Here is a (public) google doc explaining a current implementation, including 
discussion of the update rule, and choices for parameterizing the decay factor.

https://docs.google.com/document/d/1_EWeN4BkGhYbz7-agYPqHRTJm3vAzc1APsHkO7l65KE/edit?usp=sharing

In-progress code can be viewed here:

https://github.com/freeman-lab/spark/tree/streaming-kmeans

> Streaming K-Means
> -
>
> Key: SPARK-3254
> URL: https://issues.apache.org/jira/browse/SPARK-3254
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Streaming
>Reporter: Xiangrui Meng
>Assignee: Jeremy Freeman
>
> Streaming K-Means with proper decay settings.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3287:


Description: 
When ResourceManager High Availability is enabled, there will be multiple 
resource managers and each of them could act as a proxy.
AmIpFilter is modified to accept multiple proxy hosts. But Spark 
ApplicationMaster fails to read the ResourceManager IPs properly from the 
configuration.

So AmIpFilter is initialized with an empty set of proxy hosts. So any access to 
the ApplicationMaster WebUI will be redirected to port RM port on the local 
host. 


  was:
When ResourceManager High Availability is enabled, there will be multiple 
resource managers and each of them could act as a proxy.
AmIpFilter is modified to accept multiple proxy hosts. But Spark 
ApplicationMaster fails read the ResourceManager IPs properly from the 
configuration.

So AmIpFilter is initialized with an empty set of proxy hosts. So any access to 
the ApplicationMaster WebUI will be redirected to port RM port on the local 
host. 



> When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
> not displayed.
> 
>
> Key: SPARK-3287
> URL: https://issues.apache.org/jira/browse/SPARK-3287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3287.patch
>
>
> When ResourceManager High Availability is enabled, there will be multiple 
> resource managers and each of them could act as a proxy.
> AmIpFilter is modified to accept multiple proxy hosts. But Spark 
> ApplicationMaster fails to read the ResourceManager IPs properly from the 
> configuration.
> So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
> to the ApplicationMaster WebUI will be redirected to port RM port on the 
> local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3287:


Attachment: SPARK-3287.patch

In the attached patch, the resource manager list is read using new API. The 
Filter parameter which accept multiple hosts (PROXY_HOSTS instead of 
PROXY_HOST) and URL bases are used.
Since the parameter values use comma as separator, they are url-encoded so that 
it won’t clash with separator for filter parameters.

The unit tests need to be used.


> When ResourceManager High Availability is enabled, ApplicationMaster webUI is 
> not displayed.
> 
>
> Key: SPARK-3287
> URL: https://issues.apache.org/jira/browse/SPARK-3287
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3287.patch
>
>
> When ResourceManager High Availability is enabled, there will be multiple 
> resource managers and each of them could act as a proxy.
> AmIpFilter is modified to accept multiple proxy hosts. But Spark 
> ApplicationMaster fails read the ResourceManager IPs properly from the 
> configuration.
> So AmIpFilter is initialized with an empty set of proxy hosts. So any access 
> to the ApplicationMaster WebUI will be redirected to port RM port on the 
> local host. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3287) When ResourceManager High Availability is enabled, ApplicationMaster webUI is not displayed.

2014-08-28 Thread Benoy Antony (JIRA)

Benoy Antony created SPARK-3287:
---

 Summary: When ResourceManager High Availability is enabled, 
ApplicationMaster webUI is not displayed.
 Key: SPARK-3287
 URL: https://issues.apache.org/jira/browse/SPARK-3287
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony


When ResourceManager High Availability is enabled, there will be multiple 
resource managers and each of them could act as a proxy.
AmIpFilter is modified to accept multiple proxy hosts. But Spark 
ApplicationMaster fails read the ResourceManager IPs properly from the 
configuration.

So AmIpFilter is initialized with an empty set of proxy hosts. So any access to 
the ApplicationMaster WebUI will be redirected to port RM port on the local 
host. 




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoy Antony updated SPARK-3286:


Attachment: SPARK-3286.patch

In the attached patch, the Application Master URL registered by the spark 
Application Master, will contain the scheme.(http)

> Cannot view ApplicationMaster UI when Yarn’s url scheme is https
> 
>
> Key: SPARK-3286
> URL: https://issues.apache.org/jira/browse/SPARK-3286
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Benoy Antony
> Attachments: SPARK-3286.patch
>
>
> The spark Application Master starts its web UI at http://:port.
> When Spark ApplicationMaster registers its URL with Resource Manager , the 
> URL does not contain URI scheme.
> If the URL scheme is absent, Resource Manager’s web app proxy will use the 
> HTTP Policy of the Resource Manager.(YARN-1553)
> If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
> try to access https://:port.
> This will result in error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3286) Cannot view ApplicationMaster UI when Yarn’s url scheme is https

2014-08-28 Thread Benoy Antony (JIRA)

Benoy Antony created SPARK-3286:
---

 Summary: Cannot view ApplicationMaster UI when Yarn’s url scheme 
is https
 Key: SPARK-3286
 URL: https://issues.apache.org/jira/browse/SPARK-3286
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Benoy Antony


The spark Application Master starts its web UI at http://:port.
When Spark ApplicationMaster registers its URL with Resource Manager , the URL 
does not contain URI scheme.
If the URL scheme is absent, Resource Manager’s web app proxy will use the HTTP 
Policy of the Resource Manager.(YARN-1553)
If the HTTP Policy of the Resource Manager is https, then web app proxy  will 
try to access https://:port.
This will result in error.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114247#comment-14114247
 ] 

Matei Zaharia commented on SPARK-3277:
--

Thanks Mridul -- I think Andrew and Patrick have figured this out.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2475) Check whether #cores > #receivers in local mode

2014-08-28 Thread Chris Fregly (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114204#comment-14114204
 ] 

Chris Fregly commented on SPARK-2475:
-

another option for the examples, specifically, is to default the number of 
local threads similar to to how the Kinesis example does it:  

https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L104

i get the number of shards in the given Kinesis stream and add 1.  the goal was 
to make this example work out of the box with little friction - even an error 
message can be discouraging.

for the other examples, we could just default to 2.  the advanced user can 
override if they want.  though i don't think i support an override in my 
kinesis example.  whoops!  :)

> Check whether #cores > #receivers in local mode
> ---
>
> Key: SPARK-2475
> URL: https://issues.apache.org/jira/browse/SPARK-2475
> Project: Spark
>  Issue Type: Improvement
>  Components: Streaming
>Reporter: Tathagata Das
>
> When the number of slots in local mode is not more than the number of 
> receivers, then the system should throw an error. Otherwise the system just 
> keeps waiting for resources to process the received data.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3272) Calculate prediction for nodes separately from calculating information gain for splits in decision tree

2014-08-28 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114166#comment-14114166
 ] 

Joseph K. Bradley commented on SPARK-3272:
--

With respect to [SPARK-2207], I think this JIRA may or may not be necessary for 
implementing [SPARK-2207], depending on how the code is set up.  For 
[SPARK-2207], I imagined checking the number of instances and the information 
gain when the Node is constructed in the main loop (in the train() method).  If 
there are too few instances or too little information gain, then the Node will 
be set as a leaf.  We could potentially avoid the aggregation for those leafs, 
but I would consider that a separate issue ([SPARK-3158]).

> Calculate prediction for nodes separately from calculating information gain 
> for splits in decision tree
> ---
>
> Key: SPARK-3272
> URL: https://issues.apache.org/jira/browse/SPARK-3272
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
> Fix For: 1.1.0
>
>
> In current implementation, prediction for a node is calculated along with 
> calculation of information gain stats for each possible splits. The value to 
> predict for a specific node is determined, no matter what the splits are.
> To save computation, we can first calculate prediction first and then 
> calculate information gain stats for each split.
> This is also necessary if we want to support minimum instances per node 
> parameters([SPARK-2207|https://issues.apache.org/jira/browse/SPARK-2207]) 
> because when all splits don't satisfy minimum instances requirement , we 
> don't use information gain of any splits. There should be a way to get the 
> prediction value.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114113#comment-14114113
 ] 

Ted Yu commented on SPARK-1297:
---

Here is sample command for building against 0.98 hbase:

mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1 
-DskipTests clean package

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
> spark-1297-v5.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092
 ] 

Josh Rosen commented on SPARK-3280:
---

Here are some numbers from August 10.  If I recall, this was running on 8 
m3.8xlarge nodes.  This test linearly scales a bunch of parameters (data set 
size, numbers of mappers and reducers, etc).  You can see that hash-based 
shuffle's performance degrades severely in cases where we have many mappers and 
reducers, while sort scales much more gracefully:

!http://i.imgur.com/rODzaG1.png!

!http://i.imgur.com/72kCkH5.png!

This was run with spark-perf; here's a sample config for one of the bars:

{code}
Java options: -Dspark.storage.memoryFraction=0.66 
-Dspark.serializer=org.apache.spark.serializer.JavaSerializer 
-Dspark.locality.wait=6000 
-Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 
--num-partitions=400 --reduce-tasks=400 --random-seed=5 
--persistent-type=memory  --num-records=2 --unique-keys=2 
--key-length=10 --unique-values=100 --value-length=10  
--storage-location=hdfs://:9000/spark-perf-kv-data
{code}

I'll try to run a better set of tests today.  I plan to look at a few cases 
that these tests didn't address, including the performance impact when running 
on spinning disks, as well as jobs where we have a large dataset with few 
mappers and reducers (I think this is the case that we'd expect to be most 
favorable to hash-based shuffle).

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3264) Allow users to set executor Spark home in Mesos

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3264:
-

Fix Version/s: 1.1.0

> Allow users to set executor Spark home in Mesos
> ---
>
> Key: SPARK-3264
> URL: https://issues.apache.org/jira/browse/SPARK-3264
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Assignee: Andrew Or
> Fix For: 1.1.0
>
>
> There is an existing way to do this, through "spark.home". However, this is 
> neither documented nor intuitive. I propose that we add a more specific 
> config "spark.mesos.executor.home" for this purpose, and fallback to the 
> existing settings if this is not set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3264) Allow users to set executor Spark home in Mesos

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-3264.
--

Resolution: Fixed

> Allow users to set executor Spark home in Mesos
> ---
>
> Key: SPARK-3264
> URL: https://issues.apache.org/jira/browse/SPARK-3264
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Assignee: Andrew Or
>
> There is an existing way to do this, through "spark.home". However, this is 
> neither documented nor intuitive. I propose that we add a more specific 
> config "spark.mesos.executor.home" for this purpose, and fallback to the 
> existing settings if this is not set.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2608) Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other things)

2014-08-28 Thread Andrew Or (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2608?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-2608:
-

Fix Version/s: 1.1.0

> Mesos doesn't handle spark.executor.extraJavaOptions correctly (among other 
> things)
> ---
>
> Key: SPARK-2608
> URL: https://issues.apache.org/jira/browse/SPARK-2608
> Project: Spark
>  Issue Type: Bug
>  Components: Mesos
>Affects Versions: 1.0.0
>Reporter: wangfei
>Priority: Blocker
> Fix For: 1.1.0
>
>
> mesos scheduler backend use spark-class/spark-executor to launch executor 
> backend, this will lead to problems:
> 1 when set spark.executor.extraJavaOptions CoarseMesosSchedulerBackend  will 
> throw error
> 2 spark.executor.extraJavaOptions and spark.executor.extraLibraryPath set in 
> sparkconf will not be valid



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation

2014-08-28 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114055#comment-14114055
 ] 

Reynold Xin commented on SPARK-3280:


[~joshrosen] [~brkyvz] can you guys post the performance comparisons between 
sort vs hash shuffle in this ticket?

> Made sort-based shuffle the default implementation
> --
>
> Key: SPARK-3280
> URL: https://issues.apache.org/jira/browse/SPARK-3280
> Project: Spark
>  Issue Type: Improvement
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based 
> in almost all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Cheng Lian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-3277:
--

Fix Version/s: (was: 1.1.0)

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Andrew Or (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114039#comment-14114039
 ] 

Andrew Or commented on SPARK-3277:
--

This assert was added after 1.0.2. I'm assuming what you mean is that you're 
running a commit on master after 1.0.2 is released.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3150) NullPointerException in Spark recovery after simultaneous fall of master and driver

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-3150.
---

   Resolution: Fixed
Fix Version/s: 1.0.3
   1.1.1

> NullPointerException in Spark recovery after simultaneous fall of master and 
> driver
> ---
>
> Key: SPARK-3150
> URL: https://issues.apache.org/jira/browse/SPARK-3150
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment:  Linux 3.2.0-23-generic x86_64
>Reporter: Tatiana Borisova
> Fix For: 1.1.1, 1.0.3
>
>
> The issue happens when Spark is run standalone on a cluster.
> When master and driver fall simultaneously on one node in a cluster, master 
> tries to recover its state and restart spark driver.
> While restarting driver, it falls with NPE exception (stacktrace is below).
> After falling, it restarts and tries to recover its state and restart Spark 
> driver again. It happens over and over in an infinite cycle.
> Namely, Spark tries to read DriverInfo state from zookeeper, but after 
> reading it happens to be null in DriverInfo.worker.
> Stacktrace (on version 1.0.0, but reproduceable on version 1.0.2, too)
> 2014-08-14 21:44:59,519] ERROR  (akka.actor.OneForOneStrategy)
> java.lang.NullPointerException
> at 
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
> at 
> org.apache.spark.deploy.master.Master$$anonfun$completeRecovery$5.apply(Master.scala:448)
> at 
> scala.collection.TraversableLike$$anonfun$filter$1.apply(TraversableLike.scala:264)
> at scala.collection.mutable.HashSet.foreach(HashSet.scala:79)
> at 
> scala.collection.TraversableLike$class.filter(TraversableLike.scala:263)
> at scala.collection.AbstractTraversable.filter(Traversable.scala:105)
> at 
> org.apache.spark.deploy.master.Master.completeRecovery(Master.scala:448)
> at 
> org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:376)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
> at akka.actor.ActorCell.invoke(ActorCell.scala:456)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
> at akka.dispatch.Mailbox.run(Mailbox.scala:219)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> How to reproduce: kill all Spark processes when running Spark standalone on a 
> cluster on some cluster node, where driver runs (kill driver, master and 
> worker simultaneously).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114026#comment-14114026
 ] 

Mridul Muralidharan commented on SPARK-3277:


[~hzw] did you notice this against 1.0.2 ?
I did not think the changes for consolidated shuffle were backported to that 
branch, [~mateiz] can comment more though.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114022#comment-14114022
 ] 

Mridul Muralidharan edited comment on SPARK-3277 at 8/28/14 5:37 PM:
-

Attached patch is against master, though I noticed similar changes in 1.1 also 
: but not yet verified.


was (Author: mridulm80):
Against master, though I noticed similar changes in 1.1 also : but not yet 
verified.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Attachment: test_lz4_bug.patch

Against master, though I noticed similar changes in 1.1 also : but not yet 
verified.

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: test_lz4_bug.patch
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114014#comment-14114014
 ] 

Mridul Muralidharan commented on SPARK-3277:


[~matei] Attaching a patch which reproduces the bug consistently.
I suspect the issue is more serious than what I detailed above - spill to disk 
seems completely broken if I understood the assertion message correctly.
Unfortunately, this is based on a few minutes of free time I could grab - so a 
more principled debugging session is definitely warranted !



> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Priority: Blocker  (was: Major)

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread Mridul Muralidharan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-3277:
---

Affects Version/s: 1.2.0
   1.1.0

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: hzw
>Priority: Blocker
> Fix For: 1.1.0
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113977#comment-14113977
 ] 

Ted Yu commented on SPARK-1297:
---

Patch v5 is the aggregate of the 4 commits in the pull request.

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
> spark-1297-v5.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1297) Upgrade HBase dependency to 0.98.0

2014-08-28 Thread Ted Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-1297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-1297:
--

Attachment: spark-1297-v5.txt

> Upgrade HBase dependency to 0.98.0
> --
>
> Key: SPARK-1297
> URL: https://issues.apache.org/jira/browse/SPARK-1297
> Project: Spark
>  Issue Type: Task
>Reporter: Ted Yu
>Assignee: Ted Yu
>Priority: Minor
> Attachments: pom.xml, spark-1297-v2.txt, spark-1297-v4.txt, 
> spark-1297-v5.txt
>
>
> HBase 0.94.6 was released 11 months ago.
> Upgrade HBase dependency to 0.98.0



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2855.
---

Resolution: Fixed

That issue should be fixed now, so I'm going to mark this JIRA as resolved.  
Feel free to re-open (or open a new issue) if you notice flaky PySpark tests.

> pyspark test cases crashed for no reason
> 
>
> Key: SPARK-2855
> URL: https://issues.apache.org/jira/browse/SPARK-2855
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> I met this for several times, 
> all scala/java test cases passed, but pyspark test cases just crashed
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113965#comment-14113965
 ] 

Nan Zhu commented on SPARK-2855:


no

https://github.com/apache/spark/pull/1313

search "This particular failure was my fault,"

> pyspark test cases crashed for no reason
> 
>
> Key: SPARK-2855
> URL: https://issues.apache.org/jira/browse/SPARK-2855
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> I met this for several times, 
> all scala/java test cases passed, but pyspark test cases just crashed
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Josh Rosen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113957#comment-14113957
 ] 

Josh Rosen commented on SPARK-2855:
---

Do you recall the actual exception?  Was it a Py4J error (something like 
"connection to GatewayServer failed?").  It seems like we've been experiencing 
some flakiness in these tests and I wonder whether it's due to some system 
resource being exhausted, such as ephemeral ports.

> pyspark test cases crashed for no reason
> 
>
> Key: SPARK-2855
> URL: https://issues.apache.org/jira/browse/SPARK-2855
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> I met this for several times, 
> all scala/java test cases passed, but pyspark test cases just crashed
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark

2014-08-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113911#comment-14113911
 ] 

Apache Spark commented on SPARK-2435:
-

User 'mattf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2183

> Add shutdown hook to bin/pyspark
> 
>
> Key: SPARK-2435
> URL: https://issues.apache.org/jira/browse/SPARK-2435
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> We currently never stop the SparkContext cleanly in bin/pyspark unless the 
> user explicitly runs sc.stop(). This behavior is not consistent with 
> bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting 
> the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2435) Add shutdown hook to bin/pyspark

2014-08-28 Thread Matthew Farrellee (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113816#comment-14113816
 ] 

Matthew Farrellee commented on SPARK-2435:
--

i couldn't find a PR for this, and it has been a problem for me, so i've created

https://github.com/apache/spark/pull/2183

> Add shutdown hook to bin/pyspark
> 
>
> Key: SPARK-2435
> URL: https://issues.apache.org/jira/browse/SPARK-2435
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.1
>Reporter: Andrew Or
>Assignee: Josh Rosen
> Fix For: 1.1.0
>
>
> We currently never stop the SparkContext cleanly in bin/pyspark unless the 
> user explicitly runs sc.stop(). This behavior is not consistent with 
> bin/spark-shell, in which case Ctrl+D stops the SparkContext before quitting 
> the shell.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread hzw (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113789#comment-14113789
 ] 

hzw commented on SPARK-3277:


Sorry,I can not understand it clearly since I'm not familiar with the code of 
this class.
Can you point the line number of the code where it goes wrong or make a pr to 
fix this problem 

> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: hzw
> Fix For: 1.1.0
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3277) LZ4 compression cause the the ExternalSort exception

2014-08-28 Thread hzw (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hzw updated SPARK-3277:
---

Description: 
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  "spark.shuffle.spill" as false to avoid such exeception, but 
once open this "switch", this error would come.
It seems that if num of the words is few, wordcount will go through,but if it 
is a complex text ,this problem will show
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)


  was:
I tested the LZ4 compression,and it come up with such problem.(with wordcount)
Also I tested the snappy and LZF,and they were OK.
At last I set the  "spark.shuffle.spill" as false to avoid such exeception, but 
once open this "switch", this error would come.
Exeception Info as follow:
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
at 
org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)



> LZ4 compression cause the the ExternalSort exception
> 
>
> Key: SPARK-3277
> URL: https://issues.apache.org/jira/browse/SPARK-3277
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: hzw
> Fix For: 1.1.0
>
>
> I tested the LZ4 compression,and it come up with such problem.(with wordcount)
> Also I tested the snappy and LZF,and they were OK.
> At last I set the  "spark.shuffle.spill" as false to avoid such exeception, 
> but once open this "switch", this error would come.
> It seems that if num of the words is few, wordcount will go through,but if it 
> is a complex text ,this problem will show
> Exeception Info as follow:
> java.lang.AssertionError: assertion failed
> at scala.Predef$.assert(Predef.scala:165)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap$DiskMapIterator.(ExternalAppendOnlyMap.scala:416)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.spill(ExternalAppendOnlyMap.scala:235)
> at 
> org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:150)
> at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
> at 
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:55)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at 
> java.util.conc

[jira] [Commented] (SPARK-2855) pyspark test cases crashed for no reason

2014-08-28 Thread Nan Zhu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14113776#comment-14113776
 ] 

Nan Zhu commented on SPARK-2855:


I guess they have fixed this.Jenkins side mistake?

> pyspark test cases crashed for no reason
> 
>
> Key: SPARK-2855
> URL: https://issues.apache.org/jira/browse/SPARK-2855
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
>Reporter: Nan Zhu
>
> I met this for several times, 
> all scala/java test cases passed, but pyspark test cases just crashed
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17875/consoleFull



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 127 matches

Mail list logo