[jira] [Updated] (SPARK-2456) Scheduler refactoring

2014-07-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2456:
---

Description: 
This is an umbrella ticket to track scheduler refactoring. We want to clearly 
define semantics and responsibilities of each component, and define explicit 
public interfaces for them so it is easier to understand and to contribute 
(also less buggy).



  was:
This is an umbrella ticket to track scheduler refactoring. We want to clearly 
define semantics and responsibilities of each component, and define explicit 
public interfaces for them so it is easier to understand and to contribute 
(also less buggy).

[~kayousterhout]



 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2456) Scheduler refactoring

2014-07-24 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072860#comment-14072860
 ] 

Reynold Xin commented on SPARK-2456:


One related PR: https://github.com/apache/spark/pull/1561

 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2310) Support arbitrary options on the command line with spark-submit

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2310:
---

Assignee: Sandy Ryza

 Support arbitrary options on the command line with spark-submit
 ---

 Key: SPARK-2310
 URL: https://issues.apache.org/jira/browse/SPARK-2310
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Sandy Ryza
Assignee: Sandy Ryza
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2310) Support arbitrary options on the command line with spark-submit

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2310.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1253
[https://github.com/apache/spark/pull/1253]

 Support arbitrary options on the command line with spark-submit
 ---

 Key: SPARK-2310
 URL: https://issues.apache.org/jira/browse/SPARK-2310
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Sandy Ryza
 Fix For: 1.1.0






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-24 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-2664:
--

 Summary: Deal with `--conf` options in spark-submit that relate to 
flags
 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker


If someone sets a spark conf that relates to an existing flag `--master`, we 
should set it correctly like we do with the defaults file. Otherwise it can 
have confusing semantics. I noticed this after merging it, otherwise I would 
have mentioned it in the review.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2664:
---

Description: 
If someone sets a spark conf that relates to an existing flag `--master`, we 
should set it correctly like we do with the defaults file. Otherwise it can 
have confusing semantics. I noticed this after merging it, otherwise I would 
have mentioned it in the review.

I think it's as simple as modifying loadDefaults to check the user-supplied 
options also. We might change it to loadUserProperties since it's no longer 
just the defaults file.

  was:If someone sets a spark conf that relates to an existing flag `--master`, 
we should set it correctly like we do with the defaults file. Otherwise it can 
have confusing semantics. I noticed this after merging it, otherwise I would 
have mentioned it in the review.


 Deal with `--conf` options in spark-submit that relate to flags
 ---

 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker

 If someone sets a spark conf that relates to an existing flag `--master`, we 
 should set it correctly like we do with the defaults file. Otherwise it can 
 have confusing semantics. I noticed this after merging it, otherwise I would 
 have mentioned it in the review.
 I think it's as simple as modifying loadDefaults to check the user-supplied 
 options also. We might change it to loadUserProperties since it's no longer 
 just the defaults file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2652) Turning default configurations for PySpark

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072909#comment-14072909
 ] 

Apache Spark commented on SPARK-2652:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1568

 Turning default configurations for PySpark
 --

 Key: SPARK-2652
 URL: https://issues.apache.org/jira/browse/SPARK-2652
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Davies Liu
Assignee: Davies Liu
  Labels: Configuration, Python
 Fix For: 1.1.0

   Original Estimate: 48h
  Remaining Estimate: 48h

 Some default value of configuration does not make sense for PySpark, change 
 them to reasonable ones, such as spark.serializer and 
 spark.kryo.referenceTracking



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2661) Unpersist last RDD in bagel iteration

2014-07-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2661.
--

Resolution: Fixed

 Unpersist last RDD in bagel iteration
 -

 Key: SPARK-2661
 URL: https://issues.apache.org/jira/browse/SPARK-2661
 Project: Spark
  Issue Type: Improvement
Reporter: Adrian Wang
Assignee: Adrian Wang
 Fix For: 1.1.0


 In bagel iteration, we only depend on RDD[n] to get RDD[n+1], so we can 
 unpersist RDD[n-1] after we get RDD[n]. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072925#comment-14072925
 ] 

Sandy Ryza commented on SPARK-2664:
---

I think the right behavior here is worth a little thought.  What's the mental 
model we expect the user to have about the relationship between properties 
specified through --conf and properties that get their own flag?  My first 
thought is - if we're ok with taking properties like master through --conf, is 
there a point (beyond compatibility) in having flags for these properties at 
all?

Flags that aren't conf are there because they impact what happens before the 
SparkContext is created.  These fall into a couple categories:
1. Flags that have no property Spark conf equivalent like --executor-cores 
2. Flags that have a direct Spark conf equivalent like --executor-cores 
(spark.executor.memory)
3. Flags that impact a Spark conf like --deploy-mode (which can mean we set 
spark.master to yarn-cluster)

I think the two ways to look at it are:
1. We're OK with taking properties that have related flags.  In the case of a 
property in the 2nd category, we have a policy over which takes precedence.  In 
the case of a property in the 3rd category, we have some (possibly complex) 
resolution logic.  This approach would be the most accepting, but requires the 
user to have a model of how these conflicts get resolved.
2. We're not OK with taking properties that have related flags.  --conf 
specifies property that gets passed to the SparkContext and has no effect on 
anything that happens before it's created. To save users from themselves, if 
someone passes spark.master or spark.app.name through --conf, we ignore it or 
throw an error.

I'm a little more partial to approach 2 because I think the mental model is a 
little simpler.

Either way, we should probably enforce the same behavior when a config comes 
from the defaults file.

Lastly, how do we allow setting a default for one of these special flags?  E.g. 
make it so that all jobs run on YARN or Mesos by default.  With approach 1, 
this is relatively straightforward - we use the same logic we'd use on a 
property that comes in through --conf for making defaults take effect.  We 
might need to add spark properties for flags that don't have them already like 
--executor-cores.  With approach 2, we'd need to add support in the defaults 
file or somewhere else for specifying flag defaults.

 Deal with `--conf` options in spark-submit that relate to flags
 ---

 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker

 If someone sets a spark conf that relates to an existing flag `--master`, we 
 should set it correctly like we do with the defaults file. Otherwise it can 
 have confusing semantics. I noticed this after merging it, otherwise I would 
 have mentioned it in the review.
 I think it's as simple as modifying loadDefaults to check the user-supplied 
 options also. We might change it to loadUserProperties since it's no longer 
 just the defaults file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-24 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14072925#comment-14072925
 ] 

Sandy Ryza edited comment on SPARK-2664 at 7/24/14 7:18 AM:


I think the right behavior here is worth a little thought.  What's the mental 
model we expect the user to have about the relationship between properties 
specified through --conf and properties that get their own flag?  My first 
thought is - if we're ok with taking properties like master through --conf, is 
there a point (beyond compatibility) in having flags for these properties at 
all?

Flags that aren't Spark confs are there because they impact what happens before 
the SparkContext is created.  These fall into a couple categories:
1. Flags that have no property Spark conf equivalent like --executor-cores 
2. Flags that have a direct Spark conf equivalent like --executor-cores 
(spark.executor.memory)
3. Flags that impact a Spark conf like --deploy-mode (which can mean we set 
spark.master to yarn-cluster)

I think the two ways to look at it are:
1. We're OK with taking properties that have related flags.  In the case of a 
property in the 2nd category, we have a policy over which takes precedence.  In 
the case of a property in the 3rd category, we have some (possibly complex) 
resolution logic.  This approach would be the most accepting, but requires the 
user to have a model of how these conflicts get resolved.
2. We're not OK with taking properties that have related flags.  --conf 
specifies property that gets passed to the SparkContext and has no effect on 
anything that happens before it's created. To save users from themselves, if 
someone passes spark.master or spark.app.name through --conf, we ignore it or 
throw an error.

I'm a little more partial to approach 2 because I think the mental model is a 
little simpler.

Either way, we should probably enforce the same behavior when a config comes 
from the defaults file.

Lastly, how do we allow setting a default for one of these special flags?  E.g. 
make it so that all jobs run on YARN or Mesos by default.  With approach 1, 
this is relatively straightforward - we use the same logic we'd use on a 
property that comes in through --conf for making defaults take effect.  We 
might need to add spark properties for flags that don't have them already like 
--executor-cores.  With approach 2, we'd need to add support in the defaults 
file or somewhere else for specifying flag defaults.


was (Author: sandyr):
I think the right behavior here is worth a little thought.  What's the mental 
model we expect the user to have about the relationship between properties 
specified through --conf and properties that get their own flag?  My first 
thought is - if we're ok with taking properties like master through --conf, is 
there a point (beyond compatibility) in having flags for these properties at 
all?

Flags that aren't conf are there because they impact what happens before the 
SparkContext is created.  These fall into a couple categories:
1. Flags that have no property Spark conf equivalent like --executor-cores 
2. Flags that have a direct Spark conf equivalent like --executor-cores 
(spark.executor.memory)
3. Flags that impact a Spark conf like --deploy-mode (which can mean we set 
spark.master to yarn-cluster)

I think the two ways to look at it are:
1. We're OK with taking properties that have related flags.  In the case of a 
property in the 2nd category, we have a policy over which takes precedence.  In 
the case of a property in the 3rd category, we have some (possibly complex) 
resolution logic.  This approach would be the most accepting, but requires the 
user to have a model of how these conflicts get resolved.
2. We're not OK with taking properties that have related flags.  --conf 
specifies property that gets passed to the SparkContext and has no effect on 
anything that happens before it's created. To save users from themselves, if 
someone passes spark.master or spark.app.name through --conf, we ignore it or 
throw an error.

I'm a little more partial to approach 2 because I think the mental model is a 
little simpler.

Either way, we should probably enforce the same behavior when a config comes 
from the defaults file.

Lastly, how do we allow setting a default for one of these special flags?  E.g. 
make it so that all jobs run on YARN or Mesos by default.  With approach 1, 
this is relatively straightforward - we use the same logic we'd use on a 
property that comes in through --conf for making defaults take effect.  We 
might need to add spark properties for flags that don't have them already like 
--executor-cores.  With approach 2, we'd need to add support in the defaults 
file or somewhere else for specifying flag defaults.

 Deal with `--conf` options in spark-submit that relate to flags
 

[jira] [Created] (SPARK-2665) Add EqualNS support for HiveQL

2014-07-24 Thread Cheng Hao (JIRA)
Cheng Hao created SPARK-2665:


 Summary: Add EqualNS support for HiveQL
 Key: SPARK-2665
 URL: https://issues.apache.org/jira/browse/SPARK-2665
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Cheng Hao


Hive Supports the operator =, which returns same result with EQUAL(=) 
operator for non-null operands, but returns TRUE if both are NULL, FALSE if one 
of the them is NULL.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2414) Remove jquery

2014-07-24 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2414:
---

Assignee: (was: Reynold Xin)

 Remove jquery
 -

 Key: SPARK-2414
 URL: https://issues.apache.org/jira/browse/SPARK-2414
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Reynold Xin
Priority: Minor
  Labels: starter

 SPARK-2384 introduces jquery for tooltip display. We can probably just create 
 a very simple javascript for tooltip instead of pulling in jquery. 
 https://github.com/apache/spark/pull/1314



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-07-24 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073013#comment-14073013
 ] 

lukovnikov edited comment on SPARK-1405 at 7/24/14 9:10 AM:


@Isaac, I think it's at 
https://github.com/yinxusen/spark/blob/lda/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
 and here (https://github.com/apache/spark/pull/476/files) for the other 
changed files as well


was (Author: lukovnikov):
@Isaac, I think it's at 
https://github.com/yinxusen/spark/blob/lda/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
 Fix For: 0.9.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2014-07-24 Thread lukovnikov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073020#comment-14073020
 ] 

lukovnikov commented on SPARK-1405:
---

btw, could this please be merged with the main? there are some conflicts

 parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
 -

 Key: SPARK-1405
 URL: https://issues.apache.org/jira/browse/SPARK-1405
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xusen Yin
Assignee: Xusen Yin
  Labels: features
 Fix For: 0.9.0

   Original Estimate: 336h
  Remaining Estimate: 336h

 Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
 topics from text corpus. Different with current machine learning algorithms 
 in MLlib, instead of using optimization algorithms such as gradient desent, 
 LDA uses expectation algorithms such as Gibbs sampling. 
 In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
 wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
 and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073085#comment-14073085
 ] 

Apache Spark commented on SPARK-2604:
-

User 'twinkle-sachdeva' has created a pull request for this issue:
https://github.com/apache/spark/pull/1571

 Spark Application hangs on yarn in edge case scenario of executor memory 
 requirement
 

 Key: SPARK-2604
 URL: https://issues.apache.org/jira/browse/SPARK-2604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Twinkle Sachdeva

 In yarn environment, let's say :
 MaxAM = Maximum allocatable memory
 ExecMem - Executor's memory
 if (MaxAM  ExecMem  ( MaxAM - ExecMem)  384m ))
   then Maximum resource validation fails w.r.t executor memory , and 
 application master gets launched, but when resource is allocated and again 
 validated, they are returned and application appears to be hanged.
 Typical use case is to ask for executor memory = maximum allowed memory as 
 per yarn config



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2604) Spark Application hangs on yarn in edge case scenario of executor memory requirement

2014-07-24 Thread Twinkle Sachdeva (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073086#comment-14073086
 ] 

Twinkle Sachdeva commented on SPARK-2604:
-

Please review  the pull request : https://github.com/apache/spark/pull/1571

 Spark Application hangs on yarn in edge case scenario of executor memory 
 requirement
 

 Key: SPARK-2604
 URL: https://issues.apache.org/jira/browse/SPARK-2604
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Twinkle Sachdeva

 In yarn environment, let's say :
 MaxAM = Maximum allocatable memory
 ExecMem - Executor's memory
 if (MaxAM  ExecMem  ( MaxAM - ExecMem)  384m ))
   then Maximum resource validation fails w.r.t executor memory , and 
 application master gets launched, but when resource is allocated and again 
 validated, they are returned and application appears to be hanged.
 Typical use case is to ask for executor memory = maximum allowed memory as 
 per yarn config



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed

2014-07-24 Thread navanee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073111#comment-14073111
 ] 

navanee commented on SPARK-2575:


spark SVM supports multinomial  or binomial classification?

 SVMWithSGD throwing Input Validation failed 
 

 Key: SPARK-2575
 URL: https://issues.apache.org/jira/browse/SPARK-2575
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.1
Reporter: navanee

 SVMWithSGD throwing Input Validation failed while using Sparse Array as 
 Input. Though SVMWihtSGD accepts LibSVM format.
 Exception trace :
 org.apache.spark.SparkException: Input validation failed.
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145)
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188)
   at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala)
   at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

2014-07-24 Thread Lianhui Wang (JIRA)
Lianhui Wang created SPARK-2666:
---

 Summary: when task is FetchFailed cancel running tasks of 
failedStage
 Key: SPARK-2666
 URL: https://issues.apache.org/jira/browse/SPARK-2666
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang


in DAGScheduler's handleTaskCompletion,when reason of failed task is 
FetchFailed, cancel running tasks of failedStage before add failedStage to 
failedStages queue.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-24 Thread Prashant Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073114#comment-14073114
 ] 

Prashant Sharma commented on SPARK-2576:


Looking at it.

 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend Vanderveken
Assignee: Yin Huai
Priority: Blocker

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
 java.lang.NoClassDefFoundError: $line11/$read$
 at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
 at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
 at 

[jira] [Commented] (SPARK-2666) when task is FetchFailed cancel running tasks of failedStage

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073116#comment-14073116
 ] 

Apache Spark commented on SPARK-2666:
-

User 'lianhuiwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/1572

 when task is FetchFailed cancel running tasks of failedStage
 

 Key: SPARK-2666
 URL: https://issues.apache.org/jira/browse/SPARK-2666
 Project: Spark
  Issue Type: Bug
Reporter: Lianhui Wang

 in DAGScheduler's handleTaskCompletion,when reason of failed task is 
 FetchFailed, cancel running tasks of failedStage before add failedStage to 
 failedStages queue.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2456) Scheduler refactoring

2014-07-24 Thread Nan Zhu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2456?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073122#comment-14073122
 ] 

Nan Zhu commented on SPARK-2456:


maybe it's also related: https://github.com/apache/spark/pull/637

 Scheduler refactoring
 -

 Key: SPARK-2456
 URL: https://issues.apache.org/jira/browse/SPARK-2456
 Project: Spark
  Issue Type: Improvement
Reporter: Reynold Xin
Assignee: Reynold Xin

 This is an umbrella ticket to track scheduler refactoring. We want to clearly 
 define semantics and responsibilities of each component, and define explicit 
 public interfaces for them so it is easier to understand and to contribute 
 (also less buggy).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2667) getCallSiteInfo doesn't take into account that graphx is part of spark.

2014-07-24 Thread Adrian Budau (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adrian Budau updated SPARK-2667:


Description: getCallSiteInfo from org.apache.spark.util.Utils uses a regex 
pattern to match when a function is part of spark or not. At the moment this 
does not include GraphX

 getCallSiteInfo doesn't take into account that graphx is part of spark.
 ---

 Key: SPARK-2667
 URL: https://issues.apache.org/jira/browse/SPARK-2667
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
 Environment: Mac Os X, although its on all versions
Reporter: Adrian Budau
Priority: Trivial

 getCallSiteInfo from org.apache.spark.util.Utils uses a regex pattern to 
 match when a function is part of spark or not. At the moment this does not 
 include GraphX



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2667) getCallSiteInfo doesn't take into account that graphx is part of spark.

2014-07-24 Thread Adrian Budau (JIRA)
Adrian Budau created SPARK-2667:
---

 Summary: getCallSiteInfo doesn't take into account that graphx is 
part of spark.
 Key: SPARK-2667
 URL: https://issues.apache.org/jira/browse/SPARK-2667
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 1.0.0
 Environment: Mac Os X, although its on all versions
Reporter: Adrian Budau
Priority: Trivial






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Peng Zhang (JIRA)
Peng Zhang created SPARK-2668:
-

 Summary: Support log4j log to yarn container log directory
 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


Assign value of yarn container log directory to java opts spark.yarn.log.dir, 
So user defined log4j.properties can reference this value and write log to YARN 
container directory.
Otherwise, user defined file append will log to CWD, and files will not be 
displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
after job finished.

User defined log4j.properties reference example:
{code}
log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
{code}





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073199#comment-14073199
 ] 

Apache Spark commented on SPARK-2668:
-

User 'renozhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/1573

 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2150) Provide direct link to finished application UI in yarn resource manager UI

2014-07-24 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-2150:
-

Assignee: Rahul Singhal

 Provide direct link to finished application UI in yarn resource manager UI
 --

 Key: SPARK-2150
 URL: https://issues.apache.org/jira/browse/SPARK-2150
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Rahul Singhal
Assignee: Rahul Singhal
Priority: Minor
 Fix For: 1.1.0


 Currently the link that is provide as the tracking URL for a finished 
 application in yarn resource manager UI is of the Spark history server home 
 page. We should provide a direct link to the application UI so that the user 
 does not have to figure out the correspondence between yarn application ID 
 and the link on the Spark history server home page.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1112) When spark.akka.frameSize 10, task results bigger than 10MiB block execution

2014-07-24 Thread DjvuLee (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073244#comment-14073244
 ] 

DjvuLee commented on SPARK-1112:


Does anyone test in version0.9.2,I found it also failed , while  v1.0.1  
v1.1.0 is ok. 

 When spark.akka.frameSize  10, task results bigger than 10MiB block execution
 --

 Key: SPARK-1112
 URL: https://issues.apache.org/jira/browse/SPARK-1112
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.0, 1.0.0
Reporter: Guillaume Pitel
Assignee: Xiangrui Meng
Priority: Blocker
 Fix For: 0.9.2


 When I set the spark.akka.frameSize to something over 10, the messages sent 
 from the executors to the driver completely block the execution if the 
 message is bigger than 10MiB and smaller than the frameSize (if it's above 
 the frameSize, it's ok)
 Workaround is to set the spark.akka.frameSize to 10. In this case, since 
 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
 akka direct message though.
 The configuration seems to be correctly read (see actorSystemConfig.txt), so 
 I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073246#comment-14073246
 ] 

Thomas Graves commented on SPARK-2668:
--

Sorry I don't follow what you are saying here. spark on yarn uses the yarn 
approved logging directories and aggregation works fine. 

Perhaps YARN is misconfigured?

 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073278#comment-14073278
 ] 

Peng Zhang commented on SPARK-2668:
---

[~tgraves] Original log works fine, and log will be written to yarn container 
log directory and named as stderr.
But when I want to define my own log4j configuration, for example using 
RollingAppender to avoid log file too big, especially for spark Streaming(7 x 
24 hours), I should can't specify the base directory for log.

So adding spark.yarn.log.dir will help for reference in log4j.properties, 
like the example in description.
Otherwise, log files will be located in container's working directory.

 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Peng Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073278#comment-14073278
 ] 

Peng Zhang edited comment on SPARK-2668 at 7/24/14 3:12 PM:


[~tgraves] Original log works fine, and log will be written to yarn container 
log directory and named as stderr.
But when I want to define my own log4j configuration, for example using 
RollingAppender to avoid log file too big, especially for spark Streaming(7 x 
24 hours), I should  specify the base directory for log.

So adding spark.yarn.log.dir will help for reference in log4j.properties, 
like the example in description.
Otherwise, log files will be located in container's working directory.


was (Author: peng.zhang):
[~tgraves] Original log works fine, and log will be written to yarn container 
log directory and named as stderr.
But when I want to define my own log4j configuration, for example using 
RollingAppender to avoid log file too big, especially for spark Streaming(7 x 
24 hours), I should can't specify the base directory for log.

So adding spark.yarn.log.dir will help for reference in log4j.properties, 
like the example in description.
Otherwise, log files will be located in container's working directory.

 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073310#comment-14073310
 ] 

Thomas Graves commented on SPARK-2668:
--

Oh, I see you just want a variable to reference from the log4j config. I 
understand the use case and really YARN should solve this for you. There are 
jira out there to support long running tasks on yarn, the one for logs is: 
https://issues.apache.org/jira/browse/YARN-1104

This might be ok for short term workaround for that since its just reading and 
not allowing user to set it.  I need to look at it a bit closer.


 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Reporter: Peng Zhang
 Fix For: 1.0.0


 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2575) SVMWithSGD throwing Input Validation failed

2014-07-24 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073329#comment-14073329
 ] 

Xiangrui Meng commented on SPARK-2575:
--

[~dbtsai] sent a PR for multinomial logistic regression: 
https://github.com/apache/spark/pull/1379

Btw, is your problem solved?

 SVMWithSGD throwing Input Validation failed 
 

 Key: SPARK-2575
 URL: https://issues.apache.org/jira/browse/SPARK-2575
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.0.1
Reporter: navanee

 SVMWithSGD throwing Input Validation failed while using Sparse Array as 
 Input. Though SVMWihtSGD accepts LibSVM format.
 Exception trace :
 org.apache.spark.SparkException: Input validation failed.
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:145)
   at 
 org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:124)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:154)
   at 
 org.apache.spark.mllib.classification.SVMWithSGD$.train(SVM.scala:188)
   at org.apache.spark.mllib.classification.SVMWithSGD.train(SVM.scala)
   at com.xurmo.ai.hades.classification.algo.Svm.train(Svm.java:143)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.generateModelFile(SimpleSVMTest.java:172)
   at 
 com.xurmo.ai.hades.classification.algo.SimpleSVMTest.trainSampleDataTest(SimpleSVMTest.java:65)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:44)
   at 
 org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
   at 
 org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:41)
   at 
 org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
   at 
 org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runNotIgnored(BlockJUnit4ClassRunner.java:79)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:71)
   at 
 org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:49)
   at org.junit.runners.ParentRunner$3.run(ParentRunner.java:193)
   at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:52)
   at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:191)
   at org.junit.runners.ParentRunner.access$000(ParentRunner.java:42)
   at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:184)
   at org.junit.runners.ParentRunner.run(ParentRunner.java:236)
   at 
 org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
   at 
 org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
   at 
 org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2014-07-24 Thread Maxim Ivanov (JIRA)
Maxim Ivanov created SPARK-2669:
---

 Summary: Hadoop configuration is not localised when submitting job 
in yarn-cluster mode
 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
Reporter: Maxim Ivanov


I'd like to propose a fix for a problem when Hadoop configuration is not 
localized when job is submitted in yarn-cluster mode. Here is a description 
from github pull request https://github.com/apache/spark/pull/1574

This patch fixes a problem when Spark driver is run in the container
managed by YARN ResourceManager it inherits configuration from a
NodeManager process, which can be different from the Hadoop
configuration present on the client (submitting machine). Problem is
most vivid when fs.defaultFS property differs between these two.

Hadoop MR solves it by serializing client's Hadoop configuration into
job.xml in application staging directory and then making Application
Master to use it. That guarantees that regardless of execution nodes
configurations all application containers use same config identical to
one on the client side.

This patch uses similar approach. YARN ClientBase serializes
configuration and adds it to ClientDistributedCacheManager under
job.xml link name. ClientDistributedCacheManager is then utilizes
Hadoop localizer to deliver it to whatever container is started by this
application, including the one running Spark driver.

YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
container request which is then used by SparkHadoopUtil.newConfiguration
to trigger new behavior when machine-wide hadoop configuration is merged
with application specific job.xml (exactly how it is done in Hadoop MR).

SparkContext is then follows same approach, adding
SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
client-side Hadopo configuration.

Also all the references to new Configuration() which might be executed
on YARN cluster side are changed to use SparkHadoopUtil.get.conf

Please note that it fixes only core Spark, the part which I am
comfortable to test and verify the result. I didn't descend into
steaming/shark directories, so things might need to be changed there too.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2669) Hadoop configuration is not localised when submitting job in yarn-cluster mode

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073338#comment-14073338
 ] 

Apache Spark commented on SPARK-2669:
-

User 'redbaron' has created a pull request for this issue:
https://github.com/apache/spark/pull/1574

 Hadoop configuration is not localised when submitting job in yarn-cluster mode
 --

 Key: SPARK-2669
 URL: https://issues.apache.org/jira/browse/SPARK-2669
 Project: Spark
  Issue Type: Bug
Reporter: Maxim Ivanov

 I'd like to propose a fix for a problem when Hadoop configuration is not 
 localized when job is submitted in yarn-cluster mode. Here is a description 
 from github pull request https://github.com/apache/spark/pull/1574
 This patch fixes a problem when Spark driver is run in the container
 managed by YARN ResourceManager it inherits configuration from a
 NodeManager process, which can be different from the Hadoop
 configuration present on the client (submitting machine). Problem is
 most vivid when fs.defaultFS property differs between these two.
 Hadoop MR solves it by serializing client's Hadoop configuration into
 job.xml in application staging directory and then making Application
 Master to use it. That guarantees that regardless of execution nodes
 configurations all application containers use same config identical to
 one on the client side.
 This patch uses similar approach. YARN ClientBase serializes
 configuration and adds it to ClientDistributedCacheManager under
 job.xml link name. ClientDistributedCacheManager is then utilizes
 Hadoop localizer to deliver it to whatever container is started by this
 application, including the one running Spark driver.
 YARN ClientBase also adds SPARK_LOCAL_HADOOPCONF env variable to AM
 container request which is then used by SparkHadoopUtil.newConfiguration
 to trigger new behavior when machine-wide hadoop configuration is merged
 with application specific job.xml (exactly how it is done in Hadoop MR).
 SparkContext is then follows same approach, adding
 SPARK_LOCAL_HADOOPCONF env to all spawned containers to make them use
 client-side Hadopo configuration.
 Also all the references to new Configuration() which might be executed
 on YARN cluster side are changed to use SparkHadoopUtil.get.conf
 Please note that it fixes only core Spark, the part which I am
 comfortable to test and verify the result. I didn't descend into
 steaming/shark directories, so things might need to be changed there too.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1264) Documentation for setting heap sizes across all configurations

2014-07-24 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1264:
--

Assignee: (was: Aaron Davidson)

 Documentation for setting heap sizes across all configurations
 --

 Key: SPARK-1264
 URL: https://issues.apache.org/jira/browse/SPARK-1264
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 1.0.0
Reporter: Andrew Ash

 As a user, there are lots of places to configure heap sizes, and it takes a 
 bit of trial and error to figure out how to configure what you want.
 We need some more clear documentation on how set these for the cross product 
 of Spark components (master, worker, executor, driver, shell) and deployment 
 modes (Standalone, YARN, Mesos, EC2?).
 I'm happy to do the authoring if someone can help pull together the relevant 
 details.
 Here's the best I've got so far:
 {noformat}
 # Standalone cluster
 Master - SPARK_DAEMON_MEMORY - default: 512mb
 Worker - SPARK_DAEMON_MEMORY vs SPARK_WORKER_MEMORY? - default: ?  See 
 WorkerArguments.inferDefaultMemory()
 Executor - spark.executor.memory
 Driver - SPARK_DRIVER_MEMORY - default: 512mb
 Shell - A pre-built driver so SPARK_DRIVER_MEMORY - default: 512mb
 # EC2 cluster
 Master - ?
 Worker - ?
 Executor - ?
 Driver - ?
 Shell - ?
 # Mesos cluster
 Master - SPARK_DAEMON_MEMORY
 Worker - SPARK_DAEMON_MEMORY
 Executor - SPARK_EXECUTOR_MEMORY
 Driver - SPARK_DRIVER_MEMORY
 Shell - A pre-built driver so SPARK_DRIVER_MEMORY
 # YARN cluster
 Master - SPARK_MASTER_MEMORY ?
 Worker - SPARK_WORKER_MEMORY ?
 Executor - SPARK_EXECUTOR_MEMORY
 Driver - SPARK_DRIVER_MEMORY
 Shell - A pre-built driver so SPARK_DRIVER_MEMORY
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2583) ConnectionManager cannot distinguish whether error occurred or not

2014-07-24 Thread Kousuke Saruta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073419#comment-14073419
 ] 

Kousuke Saruta commented on SPARK-2583:
---

I have added some test cases to my PR for this issue.

 ConnectionManager cannot distinguish whether error occurred or not
 --

 Key: SPARK-2583
 URL: https://issues.apache.org/jira/browse/SPARK-2583
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
Priority: Critical

 ConnectionManager#handleMessage sent empty messages to another peer if some 
 error occurred or not in onReceiveCalback.
 {code}
  val ackMessage = if (onReceiveCallback != null) {
 logDebug(Calling back)
 onReceiveCallback(bufferMessage, connectionManagerId)
   } else {
 logDebug(Not calling back as callback is null)
 None
   }
   if (ackMessage.isDefined) {
 if (!ackMessage.get.isInstanceOf[BufferMessage]) {
   logDebug(Response to  + bufferMessage +  is not a buffer 
 message, it is of type 
 + ackMessage.get.getClass)
 } else if (!ackMessage.get.asInstanceOf[BufferMessage].hasAckId) {
   logDebug(Response to  + bufferMessage +  does not have ack 
 id set)
   ackMessage.get.asInstanceOf[BufferMessage].ackId = 
 bufferMessage.id
 }
   }
 // We have no way to tell peer whether error occurred or not
   sendMessage(connectionManagerId, ackMessage.getOrElse {
 Message.createBufferMessage(bufferMessage.id)
   })
 }
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2479) Comparing floating-point numbers using relative error in UnitTests

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073430#comment-14073430
 ] 

Apache Spark commented on SPARK-2479:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1576

 Comparing floating-point numbers using relative error in UnitTests
 --

 Key: SPARK-2479
 URL: https://issues.apache.org/jira/browse/SPARK-2479
 Project: Spark
  Issue Type: Improvement
Reporter: DB Tsai
Assignee: DB Tsai

 Floating point math is not exact, and most floating-point numbers end up 
 being slightly imprecise due to rounding errors. Simple values like 0.1 
 cannot be precisely represented using binary floating point numbers, and the 
 limited precision of floating point numbers means that slight changes in the 
 order of operations or the precision of intermediates can change the result. 
 That means that comparing two floats to see if they are equal is usually not 
 what we want. As long as this imprecision stays small, it can usually be 
 ignored.
 See the following famous article for detail.
 http://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
 For example:
   float a = 0.15 + 0.15
   float b = 0.1 + 0.2
   if(a == b) // can be false!
   if(a = b) // can also be false!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2538) External aggregation in Python

2014-07-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia updated SPARK-2538:
-

Priority: Critical  (was: Major)

 External aggregation in Python
 --

 Key: SPARK-2538
 URL: https://issues.apache.org/jira/browse/SPARK-2538
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
  Labels: pyspark
 Fix For: 1.0.0, 1.0.1

   Original Estimate: 72h
  Remaining Estimate: 72h

 For huge reduce tasks, user will got out of memory exception when all the 
 data can not fit in memory.
 It should put some of the data into disks and then merge them together, just 
 like what we do in Scala. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed

2014-07-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2670:
-

 Summary: FetchFailedException should be thrown when local fetch 
has failed
 Key: SPARK-2670
 URL: https://issues.apache.org/jira/browse/SPARK-2670
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta


In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult 
which size is -1 is set to results.

{code}
   case None = {
  logError(Could not get block(s) from  + cmId)
  for ((blockId, size) - req.blocks) {
results.put(new FetchResult(blockId, -1, null))
  }
{code}

The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws 
FetchFailedException so that we can retry.

But, when local fetch has failed, the failed FetchResult is not set.
So, we cannot retry for the FetchResult.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2619:
---

Assignee: Christian Tzolov

 Configurable file-mode for spark/bin folder in the .deb package. 
 -

 Key: SPARK-2619
 URL: https://issues.apache.org/jira/browse/SPARK-2619
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov
Assignee: Christian Tzolov

 Currently the /bin folder in the .dep package is hardcoded to 744. So only 
 the Root user (deb.user defaults to root) can run Spark jobs. 
 If we make /bin filemode a configural maven property then we easily generate 
 a package with less restrictive execution rights. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2619) Configurable file-mode for spark/bin folder in the .deb package.

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2619.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1531
[https://github.com/apache/spark/pull/1531]

 Configurable file-mode for spark/bin folder in the .deb package. 
 -

 Key: SPARK-2619
 URL: https://issues.apache.org/jira/browse/SPARK-2619
 Project: Spark
  Issue Type: Improvement
  Components: Build, Deploy
Reporter: Christian Tzolov
Assignee: Christian Tzolov
 Fix For: 1.1.0


 Currently the /bin folder in the .dep package is hardcoded to 744. So only 
 the Root user (deb.user defaults to root) can run Spark jobs. 
 If we make /bin filemode a configural maven property then we easily generate 
 a package with less restrictive execution rights. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist

2014-07-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2671:
-

 Summary: BlockObjectWriter should create parent directory when the 
directory doesn't exist
 Key: SPARK-2671
 URL: https://issues.apache.org/jira/browse/SPARK-2671
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Minor


BlockObjectWriter#open expects parent directory is present. 

{code}

  override def open(): BlockObjectWriter = {
fos = new FileOutputStream(file, true)
ts = new TimeTrackingOutputStream(fos)
channel = fos.getChannel()
lastValidPosition = initialPosition
bs = compressStream(new BufferedOutputStream(ts, bufferSize))
objOut = serializer.newInstance().serializeStream(bs)
initialized = true
this
  }
{code}

Normally, the parent directory is created by DiskBlockManager#createLocalDirs 
but, just in case, BlockObjectWriter#open should check the existence of the 
directory and create the directory if the directory does not exist.

I think, recoverable error should be recovered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2603) Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala

2014-07-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2603:


Fix Version/s: 1.0.2
   1.1.0

 Remove unnecessary toMap and toList in converting Java collections to Scala 
 collections JsonRDD.scala
 -

 Key: SPARK-2603
 URL: https://issues.apache.org/jira/browse/SPARK-2603
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor
 Fix For: 1.1.0, 1.0.2


 In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a 
 Scala one. These two operations are pretty expensive because they read 
 elements from a Java Map/List and then load to a Scala Map/List. We can use 
 Scala wrappers to wrap those Java collections instead of using toMap/toList.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2603) Remove unnecessary toMap and toList in converting Java collections to Scala collections JsonRDD.scala

2014-07-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-2603.
-

Resolution: Fixed

 Remove unnecessary toMap and toList in converting Java collections to Scala 
 collections JsonRDD.scala
 -

 Key: SPARK-2603
 URL: https://issues.apache.org/jira/browse/SPARK-2603
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Minor

 In JsonRDD.scalafy, we are using toMap/toList to convert a Java Map/List to a 
 Scala one. These two operations are pretty expensive because they read 
 elements from a Java Map/List and then load to a Scala Map/List. We can use 
 Scala wrappers to wrap those Java collections instead of using toMap/toList.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2672) support compressed file in wholeFile()

2014-07-24 Thread Davies Liu (JIRA)
Davies Liu created SPARK-2672:
-

 Summary: support compressed file in wholeFile()
 Key: SPARK-2672
 URL: https://issues.apache.org/jira/browse/SPARK-2672
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
 Fix For: 1.1.0


The wholeFile() can not read compressed files, it should be, just like 
textFile().



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2673) Improve Spark so that we can attach Debugger to Executors easily

2014-07-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2673:
-

 Summary: Improve Spark so that we can attach Debugger to Executors 
easily
 Key: SPARK-2673
 URL: https://issues.apache.org/jira/browse/SPARK-2673
 Project: Spark
  Issue Type: Improvement
Reporter: Kousuke Saruta


In current implementation, we are difficult to attach debugger to each Executor 
in the cluster.
There are reasons as follows.

1) It's difficult for Executors running on the same machine to open debug port 
because we can only pass same JVM options to all executors.

2)  Even if we can open unique debug port to each Executors running on the same 
machine, it's a bother to check debug port of each executor.

To solve those problem, I think following 2 improvement is needed.

1) Enable executor to open unique debug port on a machine.
2) Expand WebUI to be able to show debug ports opening in each executor.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073518#comment-14073518
 ] 

Apache Spark commented on SPARK-2464:
-

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/1577

 Twitter Receiver does not stop correctly when streamingContext.stop is called
 -

 Key: SPARK-2464
 URL: https://issues.apache.org/jira/browse/SPARK-2464
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1154) Spark fills up disk with app-* folders

2014-07-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073521#comment-14073521
 ] 

Andrew Ash commented on SPARK-1154:
---

For the record, this is Evan's PR that closed this ticket: 
https://github.com/apache/spark/pull/288

 Spark fills up disk with app-* folders
 --

 Key: SPARK-1154
 URL: https://issues.apache.org/jira/browse/SPARK-1154
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Reporter: Evan Chan
Assignee: Mingyu Kim
Priority: Critical
  Labels: starter
 Fix For: 1.0.0


 Current version of Spark fills up the disk with many app-* folders:
 $ ls /var/lib/spark
 app-20140210022347-0597  app-20140212173327-0627  app-20140218154110-0657  
 app-20140225232537-0017  app-20140225233548-0047
 app-20140210022407-0598  app-20140212173347-0628  app-20140218154130-0658  
 app-20140225232551-0018  app-20140225233556-0048
 app-20140210022427-0599  app-20140212173754-0629  app-20140218164232-0659  
 app-20140225232611-0019  app-20140225233603-0049
 app-20140210022447-0600  app-20140212182235-0630  app-20140218165133-0660  
 app-20140225232802-0020  app-20140225233610-0050
 app-20140210022508-0601  app-20140212182256-0631  app-20140218165148-0661  
 app-20140225232822-0021  app-20140225233617-0051
 app-20140210022528-0602  app-2014021314-0632  app-20140218165225-0662  
 app-20140225232940-0022  app-20140225233624-0052
 app-20140211024356-0603  app-20140213002026-0633  app-20140218165249-0663  
 app-20140225233002-0023  app-20140225233631-0053
 app-20140211024417-0604  app-20140213154948-0634  app-20140218172030-0664  
 app-20140225233056-0024  app-20140225233725-0054
 app-20140211024437-0605  app-20140213171810-0635  app-20140218193853-0665  
 app-20140225233108-0025  app-20140225233731-0055
 app-20140211024457-0606  app-20140213193637-0636  app-20140218194442-0666  
 app-20140225233124-0026  app-20140225233733-0056
 app-20140211024517-0607  app-20140214011513-0637  app-20140218194746-0667  
 app-20140225233133-0027  app-20140225233734-0057
 app-20140211024538-0608  app-20140214012151-0638  app-20140218194822-0668  
 app-20140225233147-0028  app-20140225233749-0058
 app-20140211193443-0609  app-20140214013134-0639  app-20140218212317-0669  
 app-20140225233208-0029  app-20140225233759-0059
 app-20140211195210-0610  app-20140214013332-0640  app-20140225180142-  
 app-20140225233215-0030  app-20140225233809-0060
 app-20140211213935-0611  app-20140214013642-0641  app-20140225180411-0001  
 app-20140225233224-0031  app-20140225233828-0061
 app-20140211214227-0612  app-20140214014246-0642  app-20140225180431-0002  
 app-20140225233232-0032  app-20140225234719-0062
 app-20140211215317-0613  app-20140214014607-0643  app-20140225180452-0003  
 app-20140225233239-0033  app-20140226032845-0063
 app-20140211224601-0614  app-20140214184943-0644  app-20140225180512-0004  
 app-20140225233320-0034  app-20140226033004-0064
 app-20140212022206-0615  app-20140214185118-0645  app-20140225180533-0005  
 app-20140225233328-0035  app-20140226033119-0065
 app-2014021206-0616  app-20140214185851-0646  app-20140225180553-0006  
 app-20140225233354-0036  app-2014022604-0066
 app-20140212022246-0617  app-20140214222856-0647  app-20140225181115-0007  
 app-20140225233402-0037  app-20140226033354-0067
 app-20140212043704-0618  app-20140214231312-0648  app-20140225181244-0008  
 app-20140225233409-0038  app-20140226033538-0068
 app-20140212043724-0619  app-20140214231434-0649  app-20140225182051-0009  
 app-20140225233416-0039  app-20140226033826-0069
 app-20140212043745-0620  app-20140214231542-0650  app-20140225183009-0010  
 app-20140225233426-0040  app-20140226034002-0070
 app-20140212044016-0621  app-20140214231616-0651  app-20140225184133-0011  
 app-20140225233432-0041  app-20140226034053-0071
 app-20140212044203-0622  app-20140214233016-0652  app-20140225184318-0012  
 app-20140225233439-0042  app-20140226034234-0072
 app-20140212044224-0623  app-20140214233037-0653  app-20140225184709-0013  
 app-20140225233447-0043  app-20140226034426-0073
 app-20140212045034-0624  app-20140218153242-0654  app-20140225184844-0014  
 app-20140225233526-0044  app-20140226034447-0074
 app-20140212045119-0625  app-20140218153341-0655  app-20140225190051-0015  
 app-20140225233534-0045
 app-20140212173310-0626  app-20140218153442-0656  app-20140225232516-0016  
 app-20140225233540-0046
 This problem is particularly bad if you have a whole bunch of fast jobs.   
 Also what makes the problem worse is that any jars for jobs is downloaded 
 into the app-* folder, so that fills up the disk particularly fast.
 I would like to propose two things:
 1) Spark should have a cleanup thread (or actor) which periodically removes 
 old app-* folders;   This 

[jira] [Commented] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073539#comment-14073539
 ] 

Apache Spark commented on SPARK-2670:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1578

 FetchFailedException should be thrown when local fetch has failed
 -

 Key: SPARK-2670
 URL: https://issues.apache.org/jira/browse/SPARK-2670
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult 
 which size is -1 is set to results.
 {code}
case None = {
   logError(Could not get block(s) from  + cmId)
   for ((blockId, size) - req.blocks) {
 results.put(new FetchResult(blockId, -1, null))
   }
 {code}
 The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws 
 FetchFailedException so that we can retry.
 But, when local fetch has failed, the failed FetchResult is not set.
 So, we cannot retry for the FetchResult.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2674) Add date and time types to inferSchema

2014-07-24 Thread Hossein Falaki (JIRA)
Hossein Falaki created SPARK-2674:
-

 Summary: Add date and time types to inferSchema
 Key: SPARK-2674
 URL: https://issues.apache.org/jira/browse/SPARK-2674
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.0.0
Reporter: Hossein Falaki


When I try inferSchema in PySpark on an RDD of dictionary that contains a 
datatime.datetime object, I get the following exception:

{code}
Object of type 
java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id=Etc/UTC,offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?]
 cannot be used
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2676) CLONE - LiveListenerBus should set higher capacity for its event queue

2014-07-24 Thread Zongheng Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zongheng Yang closed SPARK-2676.


Resolution: Duplicate

 CLONE - LiveListenerBus should set higher capacity for its event queue 
 ---

 Key: SPARK-2676
 URL: https://issues.apache.org/jira/browse/SPARK-2676
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Zongheng Yang
Assignee: Zongheng Yang





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2675) LiveListenerBus should set higher capacity for its event queue

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073546#comment-14073546
 ] 

Apache Spark commented on SPARK-2675:
-

User 'concretevitamin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1579

 LiveListenerBus should set higher capacity for its event queue 
 ---

 Key: SPARK-2675
 URL: https://issues.apache.org/jira/browse/SPARK-2675
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0, 1.0.1
Reporter: Zongheng Yang
Assignee: Zongheng Yang





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2671) BlockObjectWriter should create parent directory when the directory doesn't exist

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073548#comment-14073548
 ] 

Apache Spark commented on SPARK-2671:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1580

 BlockObjectWriter should create parent directory when the directory doesn't 
 exist
 -

 Key: SPARK-2671
 URL: https://issues.apache.org/jira/browse/SPARK-2671
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Minor

 BlockObjectWriter#open expects parent directory is present. 
 {code}
   override def open(): BlockObjectWriter = {
 fos = new FileOutputStream(file, true)
 ts = new TimeTrackingOutputStream(fos)
 channel = fos.getChannel()
 lastValidPosition = initialPosition
 bs = compressStream(new BufferedOutputStream(ts, bufferSize))
 objOut = serializer.newInstance().serializeStream(bs)
 initialized = true
 this
   }
 {code}
 Normally, the parent directory is created by DiskBlockManager#createLocalDirs 
 but, just in case, BlockObjectWriter#open should check the existence of the 
 directory and create the directory if the directory does not exist.
 I think, recoverable error should be recovered.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2037) yarn client mode doesn't support spark.yarn.max.executor.failures

2014-07-24 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-2037.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 yarn client mode doesn't support spark.yarn.max.executor.failures
 -

 Key: SPARK-2037
 URL: https://issues.apache.org/jira/browse/SPARK-2037
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.0
Reporter: Thomas Graves
Assignee: Guoqiang Li
 Fix For: 1.1.0


 yarn client mode doesn't support the config spark.yarn.max.executor.failures. 
  We should investigate if we need it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2674) Add date and time types to inferSchema

2014-07-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2674:


Assignee: Davies Liu  (was: Michael Armbrust)

 Add date and time types to inferSchema
 --

 Key: SPARK-2674
 URL: https://issues.apache.org/jira/browse/SPARK-2674
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.0.0
Reporter: Hossein Falaki
Assignee: Davies Liu

 When I try inferSchema in PySpark on an RDD of dictionary that contains a 
 datatime.datetime object, I get the following exception:
 {code}
 Object of type 
 java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id=Etc/UTC,offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?]
  cannot be used
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (SPARK-2674) Add date and time types to inferSchema

2014-07-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-2674:
---

Assignee: Michael Armbrust

 Add date and time types to inferSchema
 --

 Key: SPARK-2674
 URL: https://issues.apache.org/jira/browse/SPARK-2674
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.0.0
Reporter: Hossein Falaki
Assignee: Michael Armbrust

 When I try inferSchema in PySpark on an RDD of dictionary that contains a 
 datatime.datetime object, I get the following exception:
 {code}
 Object of type 
 java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id=Etc/UTC,offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?]
  cannot be used
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2674) Add date and time types to inferSchema

2014-07-24 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2674:


Target Version/s: 1.1.0

 Add date and time types to inferSchema
 --

 Key: SPARK-2674
 URL: https://issues.apache.org/jira/browse/SPARK-2674
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.0.0
Reporter: Hossein Falaki
Assignee: Davies Liu

 When I try inferSchema in PySpark on an RDD of dictionary that contains a 
 datatime.datetime object, I get the following exception:
 {code}
 Object of type 
 java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id=Etc/UTC,offset=0,dstSavings=0,useDaylight=false,transitions=0,lastRule=null],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=2014,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=22,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=4,ZONE_OFFSET=?,DST_OFFSET=?]
  cannot be used
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2387) Remove the stage barrier for better resource utilization

2014-07-24 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073597#comment-14073597
 ] 

Kay Ousterhout commented on SPARK-2387:
---

Have you done experiments to understand how much this improves performance?

With Hadoop MapReduce, I've seen this behavior significantly worsen performance 
for a few reasons.  Ultimately, the problem is that the reduce stage (the one 
the depends on the shuffle map stage) can't finish until all of the map tasks 
finish.  So, if there is a long map straggler, the reduce tasks can't finish 
anyway -- and now many more slots are hogged by the early reducers, preventing 
other jobs from making progress.  Even worse, if reduce tasks are launched 
before all map tasks have been launched, the early reducers keep map tasks from 
being launched, but can end up stopped waiting for input from mappers that 
haven't completed yet. (Although I didn't look closely at PR1328 so I'm not 
sure if the latter issue was explicitly prevented in your pull request.)

As a result of the above issues, I've heard that many places (I think Facebook, 
for example) disable this behavior in Hadoop.  So, we should make sure this 
will not hurt performance (and will significantly help!) before adding a lot of 
complexity to Spark in order to implement it.

 Remove the stage barrier for better resource utilization
 

 Key: SPARK-2387
 URL: https://issues.apache.org/jira/browse/SPARK-2387
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rui Li

 DAGScheduler divides a Spark job into multiple stages according to RDD 
 dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a 
 shuffle map stage on the map side, and another stage depending on that stage.
 Currently, the downstream stage cannot start until all its depended stages 
 have finished. This barrier between stages leads to idle slots when waiting 
 for the last few upstream tasks to finish and thus wasting cluster resources.
 Therefore we propose to remove the barrier and pre-start the reduce stage 
 once there're free slots. This can achieve better resource utilization and 
 improve the overall job performance, especially when there're lots of 
 executors granted to the application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2387) Remove the stage barrier for better resource utilization

2014-07-24 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073597#comment-14073597
 ] 

Kay Ousterhout edited comment on SPARK-2387 at 7/24/14 8:23 PM:


Have you done experiments to understand how much this improves performance?

With Hadoop MapReduce, I've seen this behavior significantly worsen performance 
for a few reasons.  Ultimately, the problem is that the reduce stage (the one 
the depends on the shuffle map stage) can't finish until all of the map tasks 
finish.  So, if there is a long map straggler, the reduce tasks can't finish 
anyway -- and now many more slots are hogged by the early reducers, preventing 
other jobs from making progress.  Even worse, if reduce tasks are launched 
before all map tasks have been launched, the early reducers keep map tasks from 
being launched, but can end up stopped waiting for input from mappers that 
haven't completed yet. (Although it looks like your pull request is done in a 
way that tries to avoid the latter problem.)

As a result of the above issues, I've heard that many places (I think Facebook, 
for example) disable this behavior in Hadoop.  So, we should make sure this 
will not hurt performance (and will significantly help!) before adding a lot of 
complexity to Spark in order to implement it.


was (Author: kayousterhout):
Have you done experiments to understand how much this improves performance?

With Hadoop MapReduce, I've seen this behavior significantly worsen performance 
for a few reasons.  Ultimately, the problem is that the reduce stage (the one 
the depends on the shuffle map stage) can't finish until all of the map tasks 
finish.  So, if there is a long map straggler, the reduce tasks can't finish 
anyway -- and now many more slots are hogged by the early reducers, preventing 
other jobs from making progress.  Even worse, if reduce tasks are launched 
before all map tasks have been launched, the early reducers keep map tasks from 
being launched, but can end up stopped waiting for input from mappers that 
haven't completed yet. (Although I didn't look closely at PR1328 so I'm not 
sure if the latter issue was explicitly prevented in your pull request.)

As a result of the above issues, I've heard that many places (I think Facebook, 
for example) disable this behavior in Hadoop.  So, we should make sure this 
will not hurt performance (and will significantly help!) before adding a lot of 
complexity to Spark in order to implement it.

 Remove the stage barrier for better resource utilization
 

 Key: SPARK-2387
 URL: https://issues.apache.org/jira/browse/SPARK-2387
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Reporter: Rui Li

 DAGScheduler divides a Spark job into multiple stages according to RDD 
 dependencies. Whenever there’s a shuffle dependency, DAGScheduler creates a 
 shuffle map stage on the map side, and another stage depending on that stage.
 Currently, the downstream stage cannot start until all its depended stages 
 have finished. This barrier between stages leads to idle slots when waiting 
 for the last few upstream tasks to finish and thus wasting cluster resources.
 Therefore we propose to remove the barrier and pre-start the reduce stage 
 once there're free slots. This can achieve better resource utilization and 
 improve the overall job performance, especially when there're lots of 
 executors granted to the application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2250) show stage RDDs in UI

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2250:
---

Assignee: Neville Li

 show stage RDDs in UI
 -

 Key: SPARK-2250
 URL: https://issues.apache.org/jira/browse/SPARK-2250
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Neville Li
Assignee: Neville Li
Priority: Minor
 Fix For: 1.1.0


 RDDs of each stage can be accessed from StageInfo#rddInfos. It'd be nice to 
 show them in the UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2250) show stage RDDs in UI

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2250.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1188
[https://github.com/apache/spark/pull/1188]

 show stage RDDs in UI
 -

 Key: SPARK-2250
 URL: https://issues.apache.org/jira/browse/SPARK-2250
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Reporter: Neville Li
Priority: Minor
 Fix For: 1.1.0


 RDDs of each stage can be accessed from StageInfo#rddInfos. It'd be nice to 
 show them in the UI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2677) BasicBlockFetchIterator#next can be wait forever

2014-07-24 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2677:
-

 Summary: BasicBlockFetchIterator#next can be wait forever
 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Critical


In BasicBlockFetchIterator#next, it waits fetch result on result.take.

{code}

override def next(): (BlockId, Option[Iterator[Any]]) = {
  resultsGotten += 1
  val startFetchWait = System.currentTimeMillis()
  val result = results.take()
  val stopFetchWait = System.currentTimeMillis()
  _fetchWaitTime += (stopFetchWait - startFetchWait)
  if (! result.failed) bytesInFlight -= result.size
  while (!fetchRequests.isEmpty 
(bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
maxBytesInFlight)) {
sendRequest(fetchRequests.dequeue())
  }
  (result.blockId, if (result.failed) None else Some(result.deserialize()))
}
{code}

But, results is implemented as LinkedBlockingQueue so if remote executor hang 
up, fetching Executor waits forever.




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2677) BasicBlockFetchIterator#next can wait forever

2014-07-24 Thread Kousuke Saruta (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-2677:
--

Summary: BasicBlockFetchIterator#next can wait forever  (was: 
BasicBlockFetchIterator#next can be wait forever)

 BasicBlockFetchIterator#next can wait forever
 -

 Key: SPARK-2677
 URL: https://issues.apache.org/jira/browse/SPARK-2677
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Critical

 In BasicBlockFetchIterator#next, it waits fetch result on result.take.
 {code}
 override def next(): (BlockId, Option[Iterator[Any]]) = {
   resultsGotten += 1
   val startFetchWait = System.currentTimeMillis()
   val result = results.take()
   val stopFetchWait = System.currentTimeMillis()
   _fetchWaitTime += (stopFetchWait - startFetchWait)
   if (! result.failed) bytesInFlight -= result.size
   while (!fetchRequests.isEmpty 
 (bytesInFlight == 0 || bytesInFlight + fetchRequests.front.size = 
 maxBytesInFlight)) {
 sendRequest(fetchRequests.dequeue())
   }
   (result.blockId, if (result.failed) None else 
 Some(result.deserialize()))
 }
 {code}
 But, results is implemented as LinkedBlockingQueue so if remote executor hang 
 up, fetching Executor waits forever.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-07-24 Thread koert kuipers (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073784#comment-14073784
 ] 

koert kuipers commented on SPARK-1855:
--

i think this makes sense. we have iterative queries that should be very quick. 
in case of machine failure i am ok if query fails, we will simply repeat. so i 
do not care about checkpoint to disk in this situation. but i do care about 
checkpoint to memory to cut my dependencies, which means they get garbage 
collected and cached rdds get cleaned up.

 Provide memory-and-local-disk RDD checkpointing
 ---

 Key: SPARK-1855
 URL: https://issues.apache.org/jira/browse/SPARK-1855
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, Spark Core
Affects Versions: 1.0.0
Reporter: Xiangrui Meng

 Checkpointing is used to cut long lineage while maintaining fault tolerance. 
 The current implementation is HDFS-based. Using the BlockRDD we can create 
 in-memory-and-local-disk (with replication) checkpoints that are not as 
 reliable as HDFS-based solution but faster.
 It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2678) `Spark-submit` overrides user application options

2014-07-24 Thread Cheng Lian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Lian updated SPARK-2678:
--

Priority: Major  (was: Minor)

 `Spark-submit` overrides user application options
 -

 Key: SPARK-2678
 URL: https://issues.apache.org/jira/browse/SPARK-2678
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.1, 1.0.2
Reporter: Cheng Lian

 Here is an example:
 {code}
 ./bin/spark-submit --class Foo some.jar --help
 {code}
 SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
 should be recognized as a user application option. But it's actually 
 overriden by {{spark-submit}} and will show {{spark-submit}} help message.
 When directly invoking {{spark-submit}}, the constraints here are:
 # Options before primary resource should be recognized as {{spark-submit}} 
 options
 # Options after primary resource should be recognized as user application 
 options
 The tricky part is how to handle scripts like {{spark-shell}} that delegate  
 {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
 options like {{--master}} and user defined application options together. For 
 example, say we'd like to write a new script {{start-thriftserver.sh}} to 
 start the Hive Thrift server, basically we may do this:
 {code}
 $SPARK_HOME/bin/spark-submit --class 
 org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
 {code}
 Then user may call this script like:
 {code}
 ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
 key=value
 {code}
 Notice that all options are captured by {{$@}}. If we put it before 
 {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
 {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
 {{spark-internal}}, they *should* all be recognized as options of 
 {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
 recognized as {{spark-submit}} option and leads to the right behavior.
 Although currently all scripts using {{spark-submit}} work correctly, we 
 still should fix this bug, because it causes option name collision between 
 {{spark-submit}} and user application, and every time we add a new option to 
 {{spark-submit}}, some existing user applications may break. However, solving 
 this bug may cause some incompatible changes.
 The suggested solution here is using {{--}} as separator of {{spark-submit}} 
 options and user application options. For the Hive Thrift server example 
 above, user should call it in this way:
 {code}
 ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
 key=value
 {code}
 And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
 options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib

2014-07-24 Thread Doris Xin (JIRA)
Doris Xin created SPARK-2679:


 Summary: Ser/De for Double to enable calling Java API from python 
in MLlib
 Key: SPARK-2679
 URL: https://issues.apache.org/jira/browse/SPARK-2679
 Project: Spark
  Issue Type: Sub-task
Reporter: Doris Xin


In order to enable Java/Scala APIs to be reused in the Python implementation of 
RandomRDD and Correlations, we need a set of ser/de for the type Double in 
_common.py and PythonMLLibAPI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2679) Ser/De for Double to enable calling Java API from python in MLlib

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073833#comment-14073833
 ] 

Apache Spark commented on SPARK-2679:
-

User 'dorx' has created a pull request for this issue:
https://github.com/apache/spark/pull/1581

 Ser/De for Double to enable calling Java API from python in MLlib
 -

 Key: SPARK-2679
 URL: https://issues.apache.org/jira/browse/SPARK-2679
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Doris Xin

 In order to enable Java/Scala APIs to be reused in the Python implementation 
 of RandomRDD and Correlations, we need a set of ser/de for the type Double in 
 _common.py and PythonMLLibAPI.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2298) Show stage attempt in UI

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2298:
---

Priority: Critical  (was: Major)

 Show stage attempt in UI
 

 Key: SPARK-2298
 URL: https://issues.apache.org/jira/browse/SPARK-2298
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Reporter: Reynold Xin
Assignee: Masayoshi TSUZUKI
Priority: Critical
 Attachments: Screen Shot 2014-06-25 at 4.54.46 PM.png


 We should add a column to the web ui to show stage attempt id. Then tasks 
 should be grouped by (stageId, stageAttempt) tuple.
 When a stage is resubmitted (e.g. due to fetch failures), we should get a 
 different entry in the web ui and tasks for the resubmission go there.
 See the attached screenshot for the confusing status quo. We currently show 
 the same stage entry twice, and then tasks appear in both. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2515) Hypothesis testing

2014-07-24 Thread Doris Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073879#comment-14073879
 ] 

Doris Xin commented on SPARK-2515:
--

Here's the proposed API for chi-squared tests (lives in 
org.apache.spark.mllib.stat.Statistics):

{code}
def chiSquare(X: RDD[Vector], method: String = “pearson”): ChiSquareTestResult
def chiSquare(x: RDD[Double], y: RDD[Double], method: String = “pearson”): 
ChiSquareTestResult
{code}

where ChiSquareTestResult : TestResult looks like:

{code}
pValue: Double
df: Array[Int] //normally a single but need to be more for anova
statistic: Double
ChiSquareSummary : Summary
{code}

So a couple points of discussion:
1. Of the many variants of the chi-squared test, what methods in addition to 
pearson do we want to support (hopefully based on popular demand)? 
http://en.wikipedia.org/wiki/Chi-squared_test
2. What special fields should ChiSquareSummary have?

 Hypothesis testing
 --

 Key: SPARK-2515
 URL: https://issues.apache.org/jira/browse/SPARK-2515
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Doris Xin

 Support common statistical tests in Spark MLlib.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2464) Twitter Receiver does not stop correctly when streamingContext.stop is called

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-2464.
--

Resolution: Fixed

 Twitter Receiver does not stop correctly when streamingContext.stop is called
 -

 Key: SPARK-2464
 URL: https://issues.apache.org/jira/browse/SPARK-2464
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.0.0, 1.0.1
Reporter: Tathagata Das
Assignee: Tathagata Das
Priority: Critical





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2014) Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default

2014-07-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2014.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

 Make PySpark store RDDs in MEMORY_ONLY_SER with compression by default
 --

 Key: SPARK-2014
 URL: https://issues.apache.org/jira/browse/SPARK-2014
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Reporter: Matei Zaharia
Assignee: Prashant Sharma
 Fix For: 1.1.0


 Since the data is serialized on the Python side, there's not much point in 
 keeping it as byte arrays in Java, or even in skipping compression. We should 
 make cache() in PySpark use MEMORY_ONLY_SER and turn on spark.rdd.compress 
 for it.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon

2014-07-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073930#comment-14073930
 ] 

Andrew Ash commented on SPARK-1044:
---

Filling up the work dir could be alleviated by fixing 
https://issues.apache.org/jira/browse/SPARK-1860 so we could enable worker dir 
cleanup automatically.  If we had automatic worker dir cleanup, would you still 
want to move the work directory to somewhere else?

 Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
 -

 Key: SPARK-1044
 URL: https://issues.apache.org/jira/browse/SPARK-1044
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Tathagata Das
Priority: Minor

 The default log location is SPARK_HOME/work/ and this leads to disk space 
 running out pretty quickly. The spark-ec2 scripts should configure the 
 cluster to automatically set the logging directory to /mnt/spark-work/ or 
 something like that on the mounted disks. The SPARK_HOME/work may also be 
 symlinked to that directory to maintain the existing setup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-786) Clean up old work directories in standalone worker

2014-07-24 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073932#comment-14073932
 ] 

Andrew Ash commented on SPARK-786:
--

Agreed.  With SPARK-1860 we could re-enable that the features from that PR by 
default and be good here (it was disabled after it had negative effects with 
long-running transactions).  I think this ticket can be closed as a dupe of 
that one

 Clean up old work directories in standalone worker
 --

 Key: SPARK-786
 URL: https://issues.apache.org/jira/browse/SPARK-786
 Project: Spark
  Issue Type: New Feature
  Components: Deploy
Affects Versions: 0.7.2
Reporter: Matei Zaharia

 We should add a setting to clean old work directories after X days. 
 Otherwise, the directory gets filled forever with shuffle files and logs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1044) Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon

2014-07-24 Thread Allan Douglas R. de Oliveira (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073938#comment-14073938
 ] 

Allan Douglas R. de Oliveira commented on SPARK-1044:
-

I think it is still a good idea even if the automatic cleanup is implemented. 
One large job or many small jobs can fill many gigabytes before the cleanup can 
kick in.

 Default spark logs location in EC2 AMI leads to out-of-disk space pretty soon
 -

 Key: SPARK-1044
 URL: https://issues.apache.org/jira/browse/SPARK-1044
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Tathagata Das
Priority: Minor

 The default log location is SPARK_HOME/work/ and this leads to disk space 
 running out pretty quickly. The spark-ec2 scripts should configure the 
 cluster to automatically set the logging directory to /mnt/spark-work/ or 
 something like that on the mounted disks. The SPARK_HOME/work may also be 
 symlinked to that directory to maintain the existing setup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1030) unneeded file required when running pyspark program using yarn-client

2014-07-24 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-1030.
---

   Resolution: Fixed
Fix Version/s: 1.0.0

Closing this now, since it was addressed as part of Spark 1.0's PySpark on YARN 
patches (including SPARK-1004).

 unneeded file required when running pyspark program using yarn-client
 -

 Key: SPARK-1030
 URL: https://issues.apache.org/jira/browse/SPARK-1030
 Project: Spark
  Issue Type: Bug
  Components: Deploy, PySpark, YARN
Affects Versions: 0.8.1
Reporter: Diana Carroll
Assignee: Josh Rosen
 Fix For: 1.0.0


 I can successfully run a pyspark program using the yarn-client master using 
 the following command:
 {code}
 SPARK_JAR=$SPARK_HOME/assembly/target/scala-2.9.3/spark-assembly_2.9.3-0.8.1-incubating-hadoop2.2.0.jar
  \
 SPARK_YARN_APP_JAR=~/testdata.txt pyspark \
 test1.py
 {code}
 However, the SPARK_YARN_APP_JAR doesn't make any sense; it's a Python 
 program, and therefore there's no JAR.  If I don't set the value, or if I set 
 the value to a non-existent files, Spark gives me an error message.  
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 None.org.apache.spark.api.java.JavaSparkContext.
 : org.apache.spark.SparkException: env SPARK_YARN_APP_JAR is not set
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:46)
 {code}
 or
 {code}
 py4j.protocol.Py4JJavaError: An error occurred while calling 
 None.org.apache.spark.api.java.JavaSparkContext.
 : java.io.FileNotFoundException: File file:dummy.txt does not exist
   at 
 org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:520)
 {code}
 My program is very simple:
 {code}
 from pyspark import SparkContext
 def main():
 sc = SparkContext(yarn-client, Simple App)
 logData = 
 sc.textFile(hdfs://localhost/user/training/weblogs/2013-09-15.log)
 numjpgs = logData.filter(lambda s: '.jpg' in s).count()
 print Number of JPG requests:  + str(numjpgs)
 {code}
 Although it reads the SPARK_YARN_APP_JAR file, it doesn't use the file at 
 all; I can point it at anything, as long as it's a valid, accessible file, 
 and it works the same.
 Although there's an obvious workaround for this bug, it's high priority from 
 my perspective because I'm working on a course to teach people how to do 
 this, and it's really hard to explain why this variable is needed!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2680) Lower spark.shuffle.memoryFraction to 0.2 by default

2014-07-24 Thread Matei Zaharia (JIRA)
Matei Zaharia created SPARK-2680:


 Summary: Lower spark.shuffle.memoryFraction to 0.2 by default
 Key: SPARK-2680
 URL: https://issues.apache.org/jira/browse/SPARK-2680
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Matei Zaharia
Priority: Minor
 Fix For: 1.1.0


Seems like it's good to be more conservative on this, in particular to try to 
fit all the data in the young gen. People can always increase it for 
performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2529) Clean the closure in foreach and foreachPartition

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2529:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 Clean the closure in foreach and foreachPartition
 -

 Key: SPARK-2529
 URL: https://issues.apache.org/jira/browse/SPARK-2529
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin

 Somehow we didn't clean the closure for foreach and foreachPartition. Should 
 do that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2531) Make BroadcastNestedLoopJoin take into account a BuildSide

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2531:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 Make BroadcastNestedLoopJoin take into account a BuildSide
 --

 Key: SPARK-2531
 URL: https://issues.apache.org/jira/browse/SPARK-2531
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.1
Reporter: Zongheng Yang
Assignee: Zongheng Yang
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2548) JavaRecoverableWordCount is missing

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2548:
-

Target Version/s: 1.1.0, 0.9.3, 1.0.3  (was: 1.1.0, 1.0.2, 0.9.3)

 JavaRecoverableWordCount is missing
 ---

 Key: SPARK-2548
 URL: https://issues.apache.org/jira/browse/SPARK-2548
 Project: Spark
  Issue Type: Bug
  Components: Documentation, Streaming
Affects Versions: 0.9.2, 1.0.1
Reporter: Xiangrui Meng
Priority: Minor

 JavaRecoverableWordCount was mentioned in the doc but not in the codebase. We 
 need to rewrite the example because the code was lost during the migration 
 from spark/spark-incubating to apache/spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2506) In yarn-cluster mode, ApplicationMaster does not clean up correctly at the end of the job if users call sc.stop manually

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2506:
-

Target Version/s: 1.0.3  (was: 1.0.2)

 In yarn-cluster mode, ApplicationMaster does not clean up correctly at the 
 end of the job if users call sc.stop manually
 

 Key: SPARK-2506
 URL: https://issues.apache.org/jira/browse/SPARK-2506
 Project: Spark
  Issue Type: Bug
  Components: Block Manager, Spark Core, YARN
Affects Versions: 1.0.1
Reporter: uncleGen
Priority: Minor

 when i call sc.stop manually, some strange ERRORs will appear:
 1. in driver log:
  INFO [Thread-116] YarnAllocationHandler: Completed container 
 container_1400565786114_79510_01_41 (state: COMPLETE, exit status: 0)
 WARN [Thread-4] BlockManagerMaster: Error sending message to 
 BlockManagerMaster in 3 attempts
 akka.pattern.AskTimeoutException: 
 Recipient[Actor[akka://spark/user/BlockManagerMaster#1994513092]] had already 
 been terminated.
   at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:134)
   at 
 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:236)
   at 
 org.apache.spark.storage.BlockManagerMaster.tell(BlockManagerMaster.scala:216)
   at 
 org.apache.spark.storage.BlockManagerMaster.stop(BlockManagerMaster.scala:208)
   at org.apache.spark.SparkEnv.stop(SparkEnv.scala:86)
   at org.apache.spark.SparkContext.stop(SparkContext.scala:993)
   at TestWeibo$.main(TestWeibo.scala:46)
   at TestWeibo.main(TestWeibo.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at 
 org.apache.spark.deploy.yarn.ApplicationMaster$$anon$1.run(ApplicationMaster.scala:192)
  INFO [Thread-116] ApplicationMaster: Allocating 1 containers to make up for 
 (potentially) lost containers
 INFO [Thread-116] YarnAllocationHandler: Will Allocate 1 executor containers, 
 each with 9600 memory
 2: in executor log:
 WARN [Connection manager future execution context-13] BlockManagerMaster: 
 Error sending message to BlockManagerMaster in 1 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at 
 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237)
   at 
 org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51)
   at 
 org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$heartBeat(BlockManager.scala:113)
   at 
 org.apache.spark.storage.BlockManager$$anonfun$initialize$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(BlockManager.scala:158)
   at org.apache.spark.util.Utils$.tryOrExit(Utils.scala:790)
   at 
 org.apache.spark.storage.BlockManager$$anonfun$initialize$1.apply$mcV$sp(BlockManager.scala:158)
   at akka.actor.Scheduler$$anon$9.run(Scheduler.scala:80)
   at 
 akka.actor.LightArrayRevolverScheduler$$anon$3$$anon$2.run(Scheduler.scala:241)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)
 WARN [Connection manager future execution context-13] BlockManagerMaster: 
 Error sending message to BlockManagerMaster in 2 attempts
 java.util.concurrent.TimeoutException: Futures timed out after [30 seconds]
   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
   at 
 scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
   at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
   at 
 scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
   at scala.concurrent.Await$.result(package.scala:107)
   at 
 org.apache.spark.storage.BlockManagerMaster.askDriverWithReply(BlockManagerMaster.scala:237)
   at 
 org.apache.spark.storage.BlockManagerMaster.sendHeartBeat(BlockManagerMaster.scala:51)
   at 
 

[jira] [Updated] (SPARK-1667) Jobs never finish successfully once bucket file missing occurred

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-1667:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 Jobs never finish successfully once bucket file missing occurred
 

 Key: SPARK-1667
 URL: https://issues.apache.org/jira/browse/SPARK-1667
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 If jobs execute shuffle, bucket files are created in a temporary directory 
 (named like spark-local-*).
 When the bucket files are missing cased by disk failure or any reasons, jobs 
 cannot execute shuffle which has same shuffle id for the bucket files.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2558) Mention --queue argument in YARN documentation

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2558:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 Mention --queue argument in YARN documentation 
 ---

 Key: SPARK-2558
 URL: https://issues.apache.org/jira/browse/SPARK-2558
 Project: Spark
  Issue Type: Documentation
  Components: YARN
Reporter: Matei Zaharia
Priority: Trivial
  Labels: Starter

 The docs about it went away when we updated the page to spark-submit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2576) slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark QL query on HDFS CSV file

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2576:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 slave node throws NoClassDefFoundError $line11.$read$ when executing a Spark 
 QL query on HDFS CSV file
 --

 Key: SPARK-2576
 URL: https://issues.apache.org/jira/browse/SPARK-2576
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 1.0.1
 Environment: One Mesos 0.19 master without zookeeper and 4 mesos 
 slaves. 
 JDK 1.7.51 and Scala 2.10.4 on all nodes. 
 HDFS from CDH5.0.3
 Spark version: I tried both with the pre-built CDH5 spark package available 
 from http://spark.apache.org/downloads.html and by packaging spark with sbt 
 0.13.2, JDK 1.7.51 and scala 2.10.4 as explained here 
 http://mesosphere.io/learn/run-spark-on-mesos/
 All nodes are running Debian 3.2.51-1 x86_64 GNU/Linux and have 
Reporter: Svend Vanderveken
Assignee: Yin Huai
Priority: Blocker

 Execution of SQL query against HDFS systematically throws a class not found 
 exception on slave nodes when executing .
 (this was originally reported on the user list: 
 http://apache-spark-user-list.1001560.n3.nabble.com/spark1-0-1-spark-sql-error-java-lang-NoClassDefFoundError-Could-not-initialize-class-line11-read-tc10135.html)
 Sample code (ran from spark-shell): 
 {code}
 val sqlContext = new org.apache.spark.sql.SQLContext(sc)
 import sqlContext.createSchemaRDD
 case class Car(timestamp: Long, objectid: String, isGreen: Boolean)
 // I get the same error when pointing to the folder 
 hdfs://vm28:8020/test/cardata
 val data = sc.textFile(hdfs://vm28:8020/test/cardata/part-0)
 val cars = data.map(_.split(,)).map ( ar = Car(ar(0).toLong, ar(1), 
 ar(2).toBoolean))
 cars.registerAsTable(mcars)
 val allgreens = sqlContext.sql(SELECT objectid from mcars where isGreen = 
 true)
 allgreens.collect.take(10).foreach(println)
 {code}
 Stack trace on the slave nodes: 
 {code}
 I0716 13:01:16.215158 13631 exec.cpp:131] Version: 0.19.0
 I0716 13:01:16.219285 13656 exec.cpp:205] Executor registered on slave 
 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO MesosExecutorBackend: Registered with Mesos as 
 executor ID 20140714-142853-485682442-5050-25487-2
 14/07/16 13:01:16 INFO SecurityManager: Changing view acls to: 
 mesos,mnubohadoop
 14/07/16 13:01:16 INFO SecurityManager: SecurityManager: authentication 
 disabled; ui acls disabled; users with view permissions: Set(mesos, 
 mnubohadoop)
 14/07/16 13:01:17 INFO Slf4jLogger: Slf4jLogger started
 14/07/16 13:01:17 INFO Remoting: Starting remoting
 14/07/16 13:01:17 INFO Remoting: Remoting started; listening on addresses 
 :[akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO Remoting: Remoting now listens on addresses: 
 [akka.tcp://spark@vm23:38230]
 14/07/16 13:01:17 INFO SparkEnv: Connecting to MapOutputTracker: 
 akka.tcp://spark@vm28:41632/user/MapOutputTracker
 14/07/16 13:01:17 INFO SparkEnv: Connecting to BlockManagerMaster: 
 akka.tcp://spark@vm28:41632/user/BlockManagerMaster
 14/07/16 13:01:17 INFO DiskBlockManager: Created local directory at 
 /tmp/spark-local-20140716130117-8ea0
 14/07/16 13:01:17 INFO MemoryStore: MemoryStore started with capacity 294.9 
 MB.
 14/07/16 13:01:17 INFO ConnectionManager: Bound socket to port 44501 with id 
 = ConnectionManagerId(vm23-hulk-priv.mtl.mnubo.com,44501)
 14/07/16 13:01:17 INFO BlockManagerMaster: Trying to register BlockManager
 14/07/16 13:01:17 INFO BlockManagerMaster: Registered BlockManager
 14/07/16 13:01:17 INFO HttpFileServer: HTTP File server directory is 
 /tmp/spark-ccf6f36c-2541-4a25-8fe4-bb4ba00ee633
 14/07/16 13:01:17 INFO HttpServer: Starting HTTP Server
 14/07/16 13:01:18 INFO Executor: Using REPL class URI: http://vm28:33973
 14/07/16 13:01:18 INFO Executor: Running task ID 2
 14/07/16 13:01:18 INFO HttpBroadcast: Started reading broadcast variable 0
 14/07/16 13:01:18 INFO MemoryStore: ensureFreeSpace(125590) called with 
 curMem=0, maxMem=309225062
 14/07/16 13:01:18 INFO MemoryStore: Block broadcast_0 stored as values to 
 memory (estimated size 122.6 KB, free 294.8 MB)
 14/07/16 13:01:18 INFO HttpBroadcast: Reading broadcast variable 0 took 
 0.294602722 s
 14/07/16 13:01:19 INFO HadoopRDD: Input split: 
 hdfs://vm28:8020/test/cardata/part-0:23960450+23960451
 I0716 13:01:19.905113 13657 exec.cpp:378] Executor asked to shutdown
 14/07/16 13:01:20 ERROR Executor: Exception in task ID 2
 java.lang.NoClassDefFoundError: $line11/$read$
 at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
 at $line12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(console:19)
 at 

[jira] [Updated] (SPARK-2425) Standalone Master is too aggressive in removing Applications

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2425:
-

Target Version/s: 1.0.3  (was: 1.0.2)

 Standalone Master is too aggressive in removing Applications
 

 Key: SPARK-2425
 URL: https://issues.apache.org/jira/browse/SPARK-2425
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Mark Hamstra
Assignee: Mark Hamstra

 When standalone Executors trying to run a particular Application fail a 
 cummulative ApplicationState.MAX_NUM_RETRY times, Master will remove the 
 Application.  This will be true even if there actually are a number of 
 Executors that are successfully running the Application.  This makes 
 long-running standalone-mode Applications in particular unnecessarily 
 vulnerable to limited failures in the cluster -- e.g., a single bad node on 
 which Executors repeatedly fail for any reason can prevent an Application 
 from starting or can result in a running Application being removed even 
 though it could continue to run successfully (just not making use of all 
 potential Workers and Executors.) 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2541) Standalone mode can't access secure HDFS anymore

2014-07-24 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das updated SPARK-2541:
-

Target Version/s: 1.1.0, 1.0.3  (was: 1.1.0, 1.0.2)

 Standalone mode can't access secure HDFS anymore
 

 Key: SPARK-2541
 URL: https://issues.apache.org/jira/browse/SPARK-2541
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0, 1.0.1
Reporter: Thomas Graves

 In spark 0.9.x you could access secure HDFS from Standalone deploy, that 
 doesn't work in 1.X anymore. 
 It looks like the issues is in SparkHadoopUtil.runAsSparkUser.  Previously it 
 wouldn't do the doAs if the currentUser == user.  Not sure how it affects 
 when the daemons run as a super user but SPARK_USER is set to someone else.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2529) Clean the closure in foreach and foreachPartition

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14073964#comment-14073964
 ] 

Apache Spark commented on SPARK-2529:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1583

 Clean the closure in foreach and foreachPartition
 -

 Key: SPARK-2529
 URL: https://issues.apache.org/jira/browse/SPARK-2529
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin

 Somehow we didn't clean the closure for foreach and foreachPartition. Should 
 do that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2668) Support log4j log to yarn container log directory

2014-07-24 Thread Peng Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Zhang updated SPARK-2668:
--

Affects Version/s: 1.0.0

 Support log4j log to yarn container log directory
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Peng Zhang

 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container directory.
 Otherwise, user defined file append will log to CWD, and files will not be 
 displayed on YARN UI,and either cannot be aggregated to HDFS log directory 
 after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2668) Add variable of yarn log directory for reference from the log4j configuration

2014-07-24 Thread Peng Zhang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2668?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Peng Zhang updated SPARK-2668:
--

Description: 
Assign value of yarn container log directory to java opts spark.yarn.log.dir, 
So user defined log4j.properties can reference this value and write log to YARN 
container's log directory.
Otherwise, user defined file appender will only write to container's CWD, and 
log files in CWD will not be displayed on YARN UI,and either cannot be 
aggregated to HDFS log directory after job finished.

User defined log4j.properties reference example:
{code}
log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
{code}



  was:
Assign value of yarn container log directory to java opts spark.yarn.log.dir, 
So user defined log4j.properties can reference this value and write log to YARN 
container directory.
Otherwise, user defined file appender will only write to container's CWD, and 
log files in CWD will not be displayed on YARN UI,and either cannot be 
aggregated to HDFS log directory after job finished.

User defined log4j.properties reference example:
{code}
log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
{code}




 Add variable of yarn log directory for reference from the log4j configuration
 -

 Key: SPARK-2668
 URL: https://issues.apache.org/jira/browse/SPARK-2668
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 1.0.0
Reporter: Peng Zhang

 Assign value of yarn container log directory to java opts 
 spark.yarn.log.dir, So user defined log4j.properties can reference this 
 value and write log to YARN container's log directory.
 Otherwise, user defined file appender will only write to container's CWD, and 
 log files in CWD will not be displayed on YARN UI,and either cannot be 
 aggregated to HDFS log directory after job finished.
 User defined log4j.properties reference example:
 {code}
 log4j.appender.rolling_file.File = ${spark.yarn.log.dir}/spark.log
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2681) With low probability, the Spark inexplicable hang

2014-07-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074031#comment-14074031
 ] 

Patrick Wendell commented on SPARK-2681:


Can you do a jstack of the executor when this happens?

 With low probability, the Spark inexplicable hang
 -

 Key: SPARK-2681
 URL: https://issues.apache.org/jira/browse/SPARK-2681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 executor log :
 {noformat}
 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 53628
 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 
 and clearing cache
 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, 
 computing it
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 9, fetching them
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 8 ms
 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called 
 with curMem=920109320, maxMem=4322230272
 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values 
 to memory (estimated size 28.1 KB, free 3.2 GB)
 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block 
 rdd_51_83
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, 
 computing it
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 28, fetching them
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty 
 blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote 
 fetches in 0 ms
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, 
 computing it
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 4 ms
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing 
 ReceivingConnection to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? 
 

[jira] [Commented] (SPARK-2681) Spark can hang when fetching shuffle blocks

2014-07-24 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074044#comment-14074044
 ] 

Guoqiang Li commented on SPARK-2681:


OK, but have some time.

 Spark can hang when fetching shuffle blocks
 ---

 Key: SPARK-2681
 URL: https://issues.apache.org/jira/browse/SPARK-2681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 executor log :
 {noformat}
 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 53628
 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 
 and clearing cache
 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, 
 computing it
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 9, fetching them
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 8 ms
 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called 
 with curMem=920109320, maxMem=4322230272
 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values 
 to memory (estimated size 28.1 KB, free 3.2 GB)
 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block 
 rdd_51_83
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, 
 computing it
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 28, fetching them
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty 
 blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote 
 fetches in 0 ms
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, 
 computing it
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 4 ms
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing 
 ReceivingConnection to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 23:05:07 INFO network.ConnectionManager: Key not valid ? 
 sun.nio.ch.SelectionKeyImpl@3dcc1da1
 14/07/24 23:05:07 INFO 

[jira] [Commented] (SPARK-2618) use config spark.scheduler.priority for specifying TaskSet's priority on DAGScheduler

2014-07-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074045#comment-14074045
 ] 

Patrick Wendell commented on SPARK-2618:


We shouldn't should expose these types of hooks into the scheduler internals. 
The TaskSet, for instance, is an implementation detail we don't want to be part 
of a public API and the priority is an internal concept.

The public API of Spark for scheduling policies is the Fair Scheduler. Many 
different types of policies can be achieved within Fair Scheduling, including 
having a high priority pool to which tasks are submitted.

 use config spark.scheduler.priority for specifying TaskSet's priority on 
 DAGScheduler
 -

 Key: SPARK-2618
 URL: https://issues.apache.org/jira/browse/SPARK-2618
 Project: Spark
  Issue Type: Improvement
Reporter: Lianhui Wang

 we use shark server to do interative query. every sql run with a job. 
 sometimes we want to immediately run a query that later be submitted to shark 
 server. so we need to provide user to define a job's priority and ensure that 
 high priority job can be firstly  launched.
 i have created a pull request: https://github.com/apache/spark/pull/1528



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-2681) Spark can hang when fetching shuffle blocks

2014-07-24 Thread Guoqiang Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074044#comment-14074044
 ] 

Guoqiang Li edited comment on SPARK-2681 at 7/25/14 4:39 AM:
-

OK, but may take some time.


was (Author: gq):
OK, but have some time.

 Spark can hang when fetching shuffle blocks
 ---

 Key: SPARK-2681
 URL: https://issues.apache.org/jira/browse/SPARK-2681
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Guoqiang Li
Priority: Blocker

 executor log :
 {noformat}
 14/07/24 22:56:52 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
 task 53628
 14/07/24 22:56:52 INFO executor.Executor: Running task ID 53628
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_3 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_18 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_16 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_19 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_20 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_21 locally
 14/07/24 22:56:52 INFO storage.BlockManager: Found block broadcast_22 locally
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Updating epoch to 236 
 and clearing cache
 14/07/24 22:56:52 INFO spark.CacheManager: Partition rdd_51_83 not found, 
 computing it
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 9, fetching them
 14/07/24 22:56:52 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:53 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:53 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 8 ms
 14/07/24 22:56:55 INFO storage.MemoryStore: ensureFreeSpace(28728) called 
 with curMem=920109320, maxMem=4322230272
 14/07/24 22:56:55 INFO storage.MemoryStore: Block rdd_51_83 stored as values 
 to memory (estimated size 28.1 KB, free 3.2 GB)
 14/07/24 22:56:55 INFO storage.BlockManagerMaster: Updated info of block 
 rdd_51_83
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_189_83 not found, 
 computing it
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Don't have map outputs 
 for shuffle 28, fetching them
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Doing the fetch; tracker 
 actor = 
 Actor[akka.tcp://spark@tuan202:49488/user/MapOutputTracker#-1031481395]
 14/07/24 22:56:55 INFO spark.MapOutputTrackerWorker: Got the output locations
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1 non-empty 
 blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 1 remote 
 fetches in 0 ms
 14/07/24 22:56:55 INFO spark.CacheManager: Partition rdd_50_83 not found, 
 computing it
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: maxBytesInFlight: 
 50331648, targetRequestSize: 10066329
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Getting 1024 
 non-empty blocks out of 1024 blocks
 14/07/24 22:56:55 INFO 
 storage.BlockFetcherIterator$BasicBlockFetcherIterator: Started 58 remote 
 fetches in 4 ms
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing 
 ReceivingConnection to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 22:57:09 INFO network.ConnectionManager: Removing SendingConnection 
 to ConnectionManagerId(tuan221,51153)
 14/07/24 23:05:07 INFO 

[jira] [Updated] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2670:
---

Priority: Critical  (was: Major)

 FetchFailedException should be thrown when local fetch has failed
 -

 Key: SPARK-2670
 URL: https://issues.apache.org/jira/browse/SPARK-2670
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Critical

 In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult 
 which size is -1 is set to results.
 {code}
case None = {
   logError(Could not get block(s) from  + cmId)
   for ((blockId, size) - req.blocks) {
 results.put(new FetchResult(blockId, -1, null))
   }
 {code}
 The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws 
 FetchFailedException so that we can retry.
 But, when local fetch has failed, the failed FetchResult is not set.
 So, we cannot retry for the FetchResult.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2670:
---

Component/s: Spark Core

 FetchFailedException should be thrown when local fetch has failed
 -

 Key: SPARK-2670
 URL: https://issues.apache.org/jira/browse/SPARK-2670
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kousuke Saruta
Priority: Critical

 In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult 
 which size is -1 is set to results.
 {code}
case None = {
   logError(Could not get block(s) from  + cmId)
   for ((blockId, size) - req.blocks) {
 results.put(new FetchResult(blockId, -1, null))
   }
 {code}
 The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws 
 FetchFailedException so that we can retry.
 But, when local fetch has failed, the failed FetchResult is not set.
 So, we cannot retry for the FetchResult.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2670) FetchFailedException should be thrown when local fetch has failed

2014-07-24 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2670:
---

Target Version/s: 1.1.0

 FetchFailedException should be thrown when local fetch has failed
 -

 Key: SPARK-2670
 URL: https://issues.apache.org/jira/browse/SPARK-2670
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Kousuke Saruta

 In BasicBlockFetchIterator, when remote fetch has failed, then FetchResult 
 which size is -1 is set to results.
 {code}
case None = {
   logError(Could not get block(s) from  + cmId)
   for ((blockId, size) - req.blocks) {
 results.put(new FetchResult(blockId, -1, null))
   }
 {code}
 The size -1 means fetch fail and BlockStoreShuffleFetcher#unpackBlock throws 
 FetchFailedException so that we can retry.
 But, when local fetch has failed, the failed FetchResult is not set.
 So, we cannot retry for the FetchResult.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2682) Javadoc generated from Scala source code is not in javadoc's index

2014-07-24 Thread Yin Huai (JIRA)
Yin Huai created SPARK-2682:
---

 Summary: Javadoc generated from Scala source code is not in 
javadoc's index
 Key: SPARK-2682
 URL: https://issues.apache.org/jira/browse/SPARK-2682
 Project: Spark
  Issue Type: Bug
Reporter: Yin Huai


Seems genjavadocSettings was deleted from SparkBuild. We need to add it back to 
let unidoc generate javadoc for our scala code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-2682) Javadoc generated from Scala source code is not in javadoc's index

2014-07-24 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-2682:


Component/s: Documentation

 Javadoc generated from Scala source code is not in javadoc's index
 --

 Key: SPARK-2682
 URL: https://issues.apache.org/jira/browse/SPARK-2682
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Yin Huai
Assignee: Yin Huai

 Seems genjavadocSettings was deleted from SparkBuild. We need to add it back 
 to let unidoc generate javadoc for our scala code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2682) Javadoc generated from Scala source code is not in javadoc's index

2014-07-24 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074109#comment-14074109
 ] 

Apache Spark commented on SPARK-2682:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/1584

 Javadoc generated from Scala source code is not in javadoc's index
 --

 Key: SPARK-2682
 URL: https://issues.apache.org/jira/browse/SPARK-2682
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Reporter: Yin Huai
Assignee: Yin Huai

 Seems genjavadocSettings was deleted from SparkBuild. We need to add it back 
 to let unidoc generate javadoc for our scala code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2664) Deal with `--conf` options in spark-submit that relate to flags

2014-07-24 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14074108#comment-14074108
 ] 

Patrick Wendell commented on SPARK-2664:


Hey Sandy,

The reason why we originally allowed spark-defaults.conf to support spark 
options that have corresponding flags (which is 1 here) is because there was no 
other way for users to set things like the master in a configuration file. This 
seems like something we need to support (at least, IIRC it was one reason some 
packagers wanted a configuration file in the first place). So my proposal here 
was to just treat --conf as the same way. I'd also be okay to just throw an 
exception if users try to set one of these as a --conf when it corresponds to a 
flag, but then it deviates a bit from the behavior of the config file which 
could be confusing.

In terms of adding new configs for flags that don't currently have a 
corresponding conf. Personally, I'm also open to doing that. It would certainly 
simplify the fact that we have two different concepts at the moment for which 
there is not a 1:1 mapping. In my experience users have been most confused 
about the fact that those flags and spark conf properties are 
partially-but-not-completely overlapping. They haven't been confused about 
precedence order as much, since we state it clearly.

In the shorter term, given that we can't revert the behavior of 
spark-defaults.conf, I'd prefer to either use that same behavior for flags (the 
original proposal) or to throw an exception if any of the reserved properties 
are set via the --conf flag instead of their dedicated flag (we can always make 
it accept more liberally later). Otherwise it's super confusing what happens if 
the user sets --conf spark.master and also --master.

 Deal with `--conf` options in spark-submit that relate to flags
 ---

 Key: SPARK-2664
 URL: https://issues.apache.org/jira/browse/SPARK-2664
 Project: Spark
  Issue Type: Bug
Reporter: Patrick Wendell
Assignee: Sandy Ryza
Priority: Blocker

 If someone sets a spark conf that relates to an existing flag `--master`, we 
 should set it correctly like we do with the defaults file. Otherwise it can 
 have confusing semantics. I noticed this after merging it, otherwise I would 
 have mentioned it in the review.
 I think it's as simple as modifying loadDefaults to check the user-supplied 
 options also. We might change it to loadUserProperties since it's no longer 
 just the defaults file.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2538) External aggregation in Python

2014-07-24 Thread Matei Zaharia (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-2538.
--

   Resolution: Fixed
Fix Version/s: (was: 1.0.1)
   (was: 1.0.0)
   1.1.0

 External aggregation in Python
 --

 Key: SPARK-2538
 URL: https://issues.apache.org/jira/browse/SPARK-2538
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Critical
  Labels: pyspark
 Fix For: 1.1.0

   Original Estimate: 72h
  Remaining Estimate: 72h

 For huge reduce tasks, user will got out of memory exception when all the 
 data can not fit in memory.
 It should put some of the data into disks and then merge them together, just 
 like what we do in Scala. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


  1   2   >