date:20150114

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-14 Thread Pedro Rodriguez (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278381#comment-14278381
 ] 

Pedro Rodriguez commented on SPARK-1405:


Worked on some preliminary testing results, but ran into a snag which greatly 
limited what I could test.

Used 4 EC2 r3.2xlarge instances, which totals 32 executors with 215GB memory, 
so fairly close to tests run by [~gq].

I have been using the data generator I wrote, but had not previously stress 
tested it for a large number of topics, and it is failing for some number of 
topics between 100-1000. I made some optimizations with mapPartitions, but 
still run into GC overhead problems. I am unsure if this is my fault or if 
somewhere breeze is generating lots of garbage when sampling from the 
Dirichlet/Multinomial distributions. Extra eyes would be great to see if it 
looks reasonable from GC perspective: 
https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala

Alternatively, [~gq], would it be possible to either get the data set you used 
for testing or are you using a different dataset for testing  [~josephkb]? It 
would be useful to have a common way to compare implementation performance

Here are the test results I did get:
Docs: 250,000
Words: 75,000
Tokens: 30,000,000
Iterations: 15
Topics: 100
Alpha=Beta=.01

Setup Time: 20s
Resampling Time: 80s
Updating Counts Time: 53s
Global Counts Time: 4s
Total Time: 170s (2.83m)

This looks like a good improvement over the original numbers (red blue plot) 
extrapolation that this ran for 10x less iterations, it would be 28m vs ~45m. I 
am fairly confident that this time will scale fairly well with number of topics 
based on the previous test results I posted. I would be more than happy to run 
more benchmark tests if I can get access to the data set used for the other 
tests or what Joseph is using to test his PR. I am also going to start working 
on refactoring into Joseph's API, will open a PR once that is done, probably 
later this week. It would be great to have both Gibbs and EM for next release.

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size

2015-01-14 Thread Derrick Burns (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Derrick Burns updated SPARK-5186:
-
Summary: Vector.equals  and Vector.hashCode are very inefficient and fail 
on SparseVectors with large size  (was: Vector.equals  and Vector.hashCode are 
very inefficient)

> Vector.equals  and Vector.hashCode are very inefficient and fail on 
> SparseVectors with large size
> -
>
> Key: SPARK-5186
> URL: https://issues.apache.org/jira/browse/SPARK-5186
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> The implementation of Vector.equals and Vector.hashCode are correct but slow 
> for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient

2015-01-14 Thread Derrick Burns (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278374#comment-14278374
 ] 

Derrick Burns commented on SPARK-5186:
--

The aforementioned pull request does fix part of the problem, however, hashCode 
is broken, not just inefficient. If one calls hashCode on a SparseVector with 
large size, one will get an insufficient memory error.  


> Vector.equals  and Vector.hashCode are very inefficient
> ---
>
> Key: SPARK-5186
> URL: https://issues.apache.org/jira/browse/SPARK-5186
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Derrick Burns
>   Original Estimate: 0.25h
>  Remaining Estimate: 0.25h
>
> The implementation of Vector.equals and Vector.hashCode are correct but slow 
> for SparseVectors that are truly sparse.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Muhammad-Ali A'rabi (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278305#comment-14278305
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:


Yeah, of course. It will take me a day or two, so I commented to say that I'm 
working on it.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5261) In some cases ,The value of word's vector representation is too big

2015-01-14 Thread Guoqiang Li (JIRA)

Guoqiang Li created SPARK-5261:
--

 Summary: In some cases ,The value of word's vector representation 
is too big
 Key: SPARK-5261
 URL: https://issues.apache.org/jira/browse/SPARK-5261
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Guoqiang Li


{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(36)
{code}
The average absolute value of the word's vector representation is 60731.8

{code}
val word2Vec = new Word2Vec()
word2Vec.
  setVectorSize(100).
  setSeed(42L).
  setNumIterations(5).
  setNumPartitions(1)
{code}
The average  absolute value of the word's vector representation is 0.13889



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278249#comment-14278249
 ] 

Apache Spark commented on SPARK-5193:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4056

> Make Spark SQL API usable in Java and remove the Java-specific API
> --
>
> Key: SPARK-5193
> URL: https://issues.apache.org/jira/browse/SPARK-5193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Java version of the SchemaRDD API causes high maintenance burden for Spark 
> SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
> both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
> usable for Java, and then we can remove the Java specific version. 
> Things to remove include (Java version of):
> - data type
> - Row
> - SQLContext
> - HiveContext
> Things to consider:
> - Scala and Java have a different collection library.
> - Scala and Java (8) have different closure interface.
> - Scala and Java can have duplicate definitions of common classes, such as 
> BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5247) Enable javadoc/scaladoc for public classes in catalyst project

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5247:
---
Assignee: Michael Armbrust

> Enable javadoc/scaladoc for public classes in catalyst project
> --
>
> Key: SPARK-5247
> URL: https://issues.apache.org/jira/browse/SPARK-5247
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> We previously did not generate any docs for the entire catalyst project. 
> Since now we are defining public APIs in that (under org.apache.spark.sql 
> outside of org.apache.spark.sql.catalyst, such as Row, types._), we should 
> start generating javadoc/scaladoc for those.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-14 Thread Corey J. Nolet (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Corey J. Nolet updated SPARK-5260:
--
Description: I have found this method extremely useful when implementing my 
own strategy for inferring a schema from parsed json. For now, I've actually 
copied the method right out of the JsonRDD class into my own project but I 
think it would be immensely useful to keep the code in Spark and expose it 
publicly somewhere else- like an object called JsonSchema.  (was: I have found 
this method extremely useful when implementing my own method for inferring a 
schema from parsed json. For now, I've actually copied the method right out of 
the JsonRDD class into my own project but I think it would be immensely useful 
to keep the code in Spark and expose it publicly somewhere else- like an object 
called JsonSchema.)

> Expose JsonRDD.allKeysWithValueTypes() in a utility class 
> --
>
> Key: SPARK-5260
> URL: https://issues.apache.org/jira/browse/SPARK-5260
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Corey J. Nolet
> Fix For: 1.3.0
>
>
> I have found this method extremely useful when implementing my own strategy 
> for inferring a schema from parsed json. For now, I've actually copied the 
> method right out of the JsonRDD class into my own project but I think it 
> would be immensely useful to keep the code in Spark and expose it publicly 
> somewhere else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class

2015-01-14 Thread Corey J. Nolet (JIRA)

Corey J. Nolet created SPARK-5260:
-

 Summary: Expose JsonRDD.allKeysWithValueTypes() in a utility class 
 Key: SPARK-5260
 URL: https://issues.apache.org/jira/browse/SPARK-5260
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Corey J. Nolet
 Fix For: 1.3.0


I have found this method extremely useful when implementing my own method for 
inferring a schema from parsed json. For now, I've actually copied the method 
right out of the JsonRDD class into my own project but I think it would be 
immensely useful to keep the code in Spark and expose it publicly somewhere 
else- like an object called JsonSchema.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM:


[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  Can we create a new JIRA to discuss the NB refactoring?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.



was (Author: rnowling):
[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.


> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228
 ] 

RJ Nowling commented on SPARK-4894:
---

[~josephkb], after some thought, I've come around and think your idea of 1 NB 
class with a Factor type parameter may be the more maintainable choice as well 
as offering some novel functionality.  But, there seems to be a lot to figure 
out (we should be checking the decision tree implementation for example) and I 
don't want to hold up what should be a relatively simple change to support 
Bernoulli NB.  What do you think?

Comments about refactoring:
(1) how often is NB used with continuous values?  I see that sklearn supports 
Gaussian NB but is this used in practice?  My understanding is that NB is 
generally used for text classification with counts or binary values, possibly 
weighted by TF-IDF.   We should probably email the users and dev lists to get 
user feedback.  If no one is asking for it, we should shelve it and focus on 
other things.

(2) after some more reflection, I can see a few more benefits to your 
suggestions of feature types (e.g., categorial, discrete counts, continuous, 
binary, etc.).  If we created corresponding FeatureLikelihood types (e.g., 
Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which 
would be easier to test, debug, and maintain versus multiple NB subclasses like 
sklearn.  Additionally, if the user can define a type for each feature, then 
users can mix and match likelihood types as well.  Most NB implementations 
treat all features the same -- what if we had a model that allowed heterozygous 
features?  If it works well in NB, it could be extended to other parts of 
MLlib.  (There is likely some overlap with decision trees since they support 
multiple feature types, so we might want to see if there is anything there we 
can reuse.)  At the API level, we could provide a basic API which takes 
{noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity 
isn't compromised and provide a more advanced API for power users.


> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry

2015-01-14 Thread SuYan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan updated SPARK-5259:
-
Description: 
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet but 
covered the partitions.

then stage is Available is true.
{code}
  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 
{code}

but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:

{code}
  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
runEvent(ExecutorLost("exec-hostA"))
runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(0),
  FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"),
  null, null, null, null))
scheduler.resubmitFailedStages()
runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
  makeMapStatus("hostB", 2), null, null, null))
val stage = scheduler.stageIdToStage(taskSets(1).stageId)
assert(stage.attemptId == 2)
assert(stage.isAvailable)
assert(stage.pendingTasks.size == 0)
  }


{code}


  was:
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.
{code}
  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 
{code}
but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:

{code}
  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success,

[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry

2015-01-14 Thread SuYan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan updated SPARK-5259:
-
Description: 
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.
{code}
  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 
{code}
but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:

{code}
  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
runEvent(ExecutorLost("exec-hostA"))
runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(0),
  FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"),
  null, null, null, null))
scheduler.resubmitFailedStages()
runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
  makeMapStatus("hostB", 2), null, null, null))
val stage = scheduler.stageIdToStage(taskSets(1).stageId)
assert(stage.attemptId == 2)
assert(stage.isAvailable)
assert(stage.pendingTasks.size == 0)
  }


{code}


  was:
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.

  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 

but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:

{code}
  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
ru

[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry

2015-01-14 Thread SuYan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

SuYan updated SPARK-5259:
-
Description: 
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.

  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 

but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:

{code}
  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
runEvent(ExecutorLost("exec-hostA"))
runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(0),
  FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"),
  null, null, null, null))
scheduler.resubmitFailedStages()
runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
  makeMapStatus("hostB", 2), null, null, null))
val stage = scheduler.stageIdToStage(taskSets(1).stageId)
assert(stage.attemptId == 2)
assert(stage.isAvailable)
assert(stage.pendingTasks.size == 0)
  }


{code}


  was:
1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.

  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 

but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:


  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
runEvent(ExecutorLos

[jira] [Commented] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278222#comment-14278222
 ] 

Apache Spark commented on SPARK-5259:
-

User 'suyanNone' has created a pull request for this issue:
https://github.com/apache/spark/pull/4055

> Add task equal() and hashcode() to avoid stage.pendingTasks not accurate 
> while stage was retry 
> ---
>
> Key: SPARK-5259
> URL: https://issues.apache.org/jira/browse/SPARK-5259
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.1.1, 1.2.0
>Reporter: SuYan
> Fix For: 1.2.0
>
>
> 1. while shuffle stage was retry, there may have 2 taskSet running. 
> we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
> re-run taskSet0.0's un-complete task
> if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.
> then stage is Available is true.
>   def isAvailable: Boolean = {
> if (!isShuffleMap) {
>   true
> } else {
>   numAvailableOutputs == numPartitions
> }
>   } 
> but stage.pending task is not empty, to protect register mapStatus in 
> mapOutputTracker.
> because if task is complete success, pendingTasks is minus Task in 
> reference-level because the task is not override hashcode() and equals()
> pendingTask -= task
> but numAvailableOutputs is according to partitionID.
> here is the testcase to prove:
>   test("Make sure mapStage.pendingtasks is set() " +
> "while MapStage.isAvailable is true while stage was retry ") {
> val firstRDD = new MyRDD(sc, 6, Nil)
> val firstShuffleDep = new ShuffleDependency(firstRDD, null)
> val firstShuyffleId = firstShuffleDep.shuffleId
> val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
> val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
> val shuffleId = shuffleDep.shuffleId
> val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
> submit(reduceRdd, Array(0, 1))
> complete(taskSets(0), Seq(
>   (Success, makeMapStatus("hostB", 1)),
>   (Success, makeMapStatus("hostB", 2)),
>   (Success, makeMapStatus("hostC", 3)),
>   (Success, makeMapStatus("hostB", 4)),
>   (Success, makeMapStatus("hostB", 5)),
>   (Success, makeMapStatus("hostC", 6))
> ))
> complete(taskSets(1), Seq(
>   (Success, makeMapStatus("hostA", 1)),
>   (Success, makeMapStatus("hostB", 2)),
>   (Success, makeMapStatus("hostA", 1)),
>   (Success, makeMapStatus("hostB", 2)),
>   (Success, makeMapStatus("hostA", 1))
> ))
> runEvent(ExecutorLost("exec-hostA"))
> runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
> null, null))
> runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
> null, null))
> runEvent(CompletionEvent(taskSets(1).tasks(0),
>   FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"),
>   null, null, null, null))
> scheduler.resubmitFailedStages()
> runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
>   makeMapStatus("hostC", 1), null, null, null))
> runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
>   makeMapStatus("hostC", 1), null, null, null))
> runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
>   makeMapStatus("hostC", 1), null, null, null))
> runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
>   makeMapStatus("hostB", 2), null, null, null))
> val stage = scheduler.stageIdToStage(taskSets(1).stageId)
> assert(stage.attemptId == 2)
> assert(stage.isAvailable)
> assert(stage.pendingTasks.size == 0)
>   }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry

2015-01-14 Thread SuYan (JIRA)

SuYan created SPARK-5259:


 Summary: Add task equal() and hashcode() to avoid 
stage.pendingTasks not accurate while stage was retry 
 Key: SPARK-5259
 URL: https://issues.apache.org/jira/browse/SPARK-5259
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0, 1.1.1
Reporter: SuYan
 Fix For: 1.2.0


1. while shuffle stage was retry, there may have 2 taskSet running. 

we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will 
re-run taskSet0.0's un-complete task

if taskSet0.0 was run all the task that the taskSet0.1 not complete yet.

then stage is Available is true.

  def isAvailable: Boolean = {
if (!isShuffleMap) {
  true
} else {
  numAvailableOutputs == numPartitions
}
  } 

but stage.pending task is not empty, to protect register mapStatus in 
mapOutputTracker.

because if task is complete success, pendingTasks is minus Task in 
reference-level because the task is not override hashcode() and equals()
pendingTask -= task

but numAvailableOutputs is according to partitionID.

here is the testcase to prove:


  test("Make sure mapStage.pendingtasks is set() " +
"while MapStage.isAvailable is true while stage was retry ") {
val firstRDD = new MyRDD(sc, 6, Nil)
val firstShuffleDep = new ShuffleDependency(firstRDD, null)
val firstShuyffleId = firstShuffleDep.shuffleId
val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep))
val shuffleDep = new ShuffleDependency(shuffleMapRdd, null)
val shuffleId = shuffleDep.shuffleId
val reduceRdd = new MyRDD(sc, 2, List(shuffleDep))
submit(reduceRdd, Array(0, 1))
complete(taskSets(0), Seq(
  (Success, makeMapStatus("hostB", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostC", 3)),
  (Success, makeMapStatus("hostB", 4)),
  (Success, makeMapStatus("hostB", 5)),
  (Success, makeMapStatus("hostC", 6))
))
complete(taskSets(1), Seq(
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1)),
  (Success, makeMapStatus("hostB", 2)),
  (Success, makeMapStatus("hostA", 1))
))
runEvent(ExecutorLost("exec-hostA"))
runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, 
null, null))
runEvent(CompletionEvent(taskSets(1).tasks(0),
  FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"),
  null, null, null, null))
scheduler.resubmitFailedStages()
runEvent(CompletionEvent(taskSets(1).tasks(0), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(2), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(4), Success,
  makeMapStatus("hostC", 1), null, null, null))
runEvent(CompletionEvent(taskSets(1).tasks(5), Success,
  makeMapStatus("hostB", 2), null, null, null))
val stage = scheduler.stageIdToStage(taskSets(1).stageId)
assert(stage.attemptId == 2)
assert(stage.isAvailable)
assert(stage.pendingTasks.size == 0)
  }








--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278163#comment-14278163
 ] 

Apache Spark commented on SPARK-5193:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4054

> Make Spark SQL API usable in Java and remove the Java-specific API
> --
>
> Key: SPARK-5193
> URL: https://issues.apache.org/jira/browse/SPARK-5193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Java version of the SchemaRDD API causes high maintenance burden for Spark 
> SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
> both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
> usable for Java, and then we can remove the Java specific version. 
> Things to remove include (Java version of):
> - data type
> - Row
> - SQLContext
> - HiveContext
> Things to consider:
> - Scala and Java have a different collection library.
> - Scala and Java (8) have different closure interface.
> - Scala and Java can have duplicate definitions of common classes, such as 
> BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278160#comment-14278160
 ] 

Apache Spark commented on SPARK-5254:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4053

> Update the user guide to make clear that spark.mllib is not being deprecated
> 
>
> Key: SPARK-5254
> URL: https://issues.apache.org/jira/browse/SPARK-5254
> Project: Spark
>  Issue Type: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.3.0, 1.2.1
>
>
> The current statement in the user guide may deliver confusing messages to 
> users. spark.ml contains high-level APIs for building ML pipelines. But it 
> doesn't mean that spark.mllib is being deprecated.
> First of all, the pipeline API is in its alpha stage and we need to see more 
> use cases from the community to stabilizes it, which may take several 
> releases. Secondly, the components in spark.ml are simple wrappers over 
> spark.mllib implementations. Neither the APIs or the implementations from 
> spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs 
> to build their ML pipelines, but we will keep supporting and adding features 
> to spark.mllib. For example, there are many features in review at 
> https://spark-prs.appspot.com/#mllib. So users should be comfortable with 
> using spark.mllib features and expect more coming. The user guide needs to be 
> updated to make the message clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called

2015-01-14 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278161#comment-14278161
 ] 

Saisai Shao commented on SPARK-5147:


1. Currently detecting whether to delete the WAL is at clearMetatdata(), which 
is checkpointing and after the batch is finished.

2. Actually this is point what I'm thinking of and can be improved possibly.

The throughput what I mean is that 2 copies of BM and 3 copies of HDFS will 
occupy the network bandwidth more than of 1 copies of BM, 3 copies of HDFS, not 
the response time. Yes the HDFS replication is much slower than BM replication. 
But concurrently both replicating to BM and HDFS will increase the network 
bandwidth contention and lower the throughput of whole system.

> write ahead logs from streaming receiver are not purged because 
> cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
> --
>
> Key: SPARK-5147
> URL: https://issues.apache.org/jira/browse/SPARK-5147
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Affects Versions: 1.2.0
>Reporter: Max Xu
>Priority: Blocker
>
> Hi all,
> We are running a Spark streaming application with ReliableKafkaReceiver. We 
> have "spark.streaming.receiver.writeAheadLog.enable" set to true so write 
> ahead logs (WALs) for received data are created under receivedData/streamId 
> folder in the checkpoint directory. 
> However, old WALs are never purged by time. receivedBlockMetadata and 
> checkpoint files are purged correctly though. I went through the code, 
> WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is 
> responsible for cleaning up the old blocks. It has method cleanupOldBlocks, 
> which is never called by any class. ReceiverSupervisorImpl class holds a 
> WriteAheadLogBasedBlockHandler  instance. However, it only calls storeBlock 
> method to create WALs but never calls cleanupOldBlocks method to purge old 
> WALs.
> The size of the WAL folder increases constantly on HDFS. This is preventing 
> us from running the ReliableKafkaReceiver 24x7. Can somebody please take a 
> look.
> Thanks,
> Max



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5258) Clean up exposed classes in sql.hive package

2015-01-14 Thread Reynold Xin (JIRA)

Reynold Xin created SPARK-5258:
--

 Summary: Clean up exposed classes in sql.hive package
 Key: SPARK-5258
 URL: https://issues.apache.org/jira/browse/SPARK-5258
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5257) SparseVector indices must be non-negative

2015-01-14 Thread Derrick Burns (JIRA)

Derrick Burns created SPARK-5257:


 Summary: SparseVector indices must be non-negative
 Key: SPARK-5257
 URL: https://issues.apache.org/jira/browse/SPARK-5257
 Project: Spark
  Issue Type: Documentation
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Derrick Burns
Priority: Minor


The description of SparseVector suggests only that the indices have to be 
distinct integers.  However the code for the constructor that takes an array of 
(index, value) tuples assumes that the indices are non-negative.

Either the code must be changed or the description should be changed.  

This arose when I generated indices via hashing and converting the hash values 
to (signed) integers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5254.
--
   Resolution: Fixed
Fix Version/s: 1.2.1
   1.3.0

Issue resolved by pull request 4052
[https://github.com/apache/spark/pull/4052]

> Update the user guide to make clear that spark.mllib is not being deprecated
> 
>
> Key: SPARK-5254
> URL: https://issues.apache.org/jira/browse/SPARK-5254
> Project: Spark
>  Issue Type: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.3.0, 1.2.1
>
>
> The current statement in the user guide may deliver confusing messages to 
> users. spark.ml contains high-level APIs for building ML pipelines. But it 
> doesn't mean that spark.mllib is being deprecated.
> First of all, the pipeline API is in its alpha stage and we need to see more 
> use cases from the community to stabilizes it, which may take several 
> releases. Secondly, the components in spark.ml are simple wrappers over 
> spark.mllib implementations. Neither the APIs or the implementations from 
> spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs 
> to build their ML pipelines, but we will keep supporting and adding features 
> to spark.mllib. For example, there are many features in review at 
> https://spark-prs.appspot.com/#mllib. So users should be comfortable with 
> using spark.mllib features and expect more coming. The user guide needs to be 
> updated to make the message clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277988#comment-14277988
 ] 

Alexander Ulanov commented on SPARK-5256:
-

Also, asynchronous gradient update might be a good thing to have.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Alexander Ulanov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277986#comment-14277986
 ] 

Alexander Ulanov commented on SPARK-5256:
-

I would like to improve Gradient interface, so it will be able to process 
something more general than `Label` (which is relevant only to classifiers but 
not to other machine learning methods) and also allowing batch processing. The 
simplest way for me of doing this is to add another function to `Gradient` 
interface:

def compute(data: Vector, output: Vector, weights: Vector, cumGradient: 
Vector): Double

In `Gradient` trait it should call `compute` with `label`. Of course, one needs 
to make some adjustments to LBFGS and GradientDescent optimizers, replacing 
label: double with output:vector. 

 For batch processing one can put data and output points stacked into a long 
vector (matrices are stored in this way in breeze) and pass them with the 
proposed interface.

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277932#comment-14277932
 ] 

Xiangrui Meng commented on SPARK-5226:
--

[~angellandros] Before you start coding, could you describe the computation and 
communication cost of a parallel implementation of DBSCAN as well as the 
proposed APIs (parameters, methods)? It would help us decide whether to include 
it in MLlib.

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277933#comment-14277933
 ] 

Joseph K. Bradley commented on SPARK-4894:
--

+1 for small changes, but occasionally larger ones are needed.  The jump to the 
spark.ml package might be a good time to make such a switch.

We don't want to try to approach the complexity of Factorie, but it should not 
be too hard to support continuous variables.  Gaussian NB would surely be the 
first generalization to add.

I agree we should think about the API carefully to encourage people to use NB 
correctly.  The items to decide are:
* Should we only support discrete variables?  If so, then there is nothing to 
discuss.
* If we support continuous variables (and I would argue we should), then we 
must design a good API.  No matter what API we design, people may misuse it by 
treating a categorical feature as continuous.  Documentation is necessary and 
helpful, but a good API is important too.  That API could take several forms:
** 1 NaiveBayes class, with a Factor type parameter
** multiple NaiveBayes classes for different variable types and/or distributions
** a better way of specifying types, such as enumerations for discrete variables

Basically, to support continuous features, we will need to think about types, 
and I agree it should be thought out so it does not become too complex.

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5226:
-
Affects Version/s: (was: 1.2.0)

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5226:
-
Target Version/s:   (was: 1.2.0)

> Add DBSCAN Clustering Algorithm to MLlib
> 
>
> Key: SPARK-5226
> URL: https://issues.apache.org/jira/browse/SPARK-5226
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Muhammad-Ali A'rabi
>Priority: Minor
>  Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)

2015-01-14 Thread Sandy Ryza (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sandy Ryza updated SPARK-3655:
--
Priority: Major  (was: Minor)

> Support sorting of values in addition to keys (i.e. secondary sort)
> ---
>
> Key: SPARK-3655
> URL: https://issues.apache.org/jira/browse/SPARK-3655
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.1.0, 1.2.0
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>
> Now that spark has a sort based shuffle, can we expect a secondary sort soon? 
> There are some use cases where getting a sorted iterator of values per key is 
> helpful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5256:
-
Comment: was deleted

(was: Generalization: grouped optimization)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5256:
-
Comment: was deleted

(was: Generalization: grouped optimization)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5256:
-
Comment: was deleted

(was: Improving "Updater" concept)

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277861#comment-14277861
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

Generalization: grouped optimization

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277858#comment-14277858
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

Generalization: grouped optimization

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277857#comment-14277857
 ] 

Joseph K. Bradley commented on SPARK-5256:
--

Improving "Updater" concept

> Improving MLlib optimization APIs
> -
>
> Key: SPARK-5256
> URL: https://issues.apache.org/jira/browse/SPARK-5256
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>
> *Goal*: Improve APIs for optimization
> *Motivation*: There have been several disjoint mentions of improving the 
> optimization APIs to make them more pluggable, extensible, etc.  This JIRA is 
> a place to discuss what API changes are necessary for the long term, and to 
> provide links to other relevant JIRAs.
> Eventually, I hope this leads to a design doc outlining:
> * current issues
> * requirements such as supporting many types of objective functions, 
> optimization algorithms, and parameters to those algorithms
> * ideal API
> * breakdown of smaller JIRAs needed to achieve that API
> I will soon create an initial design doc, and I will try to watch this JIRA 
> and include ideas from JIRA comments.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5256) Improving MLlib optimization APIs

2015-01-14 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5256:


 Summary: Improving MLlib optimization APIs
 Key: SPARK-5256
 URL: https://issues.apache.org/jira/browse/SPARK-5256
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley


*Goal*: Improve APIs for optimization

*Motivation*: There have been several disjoint mentions of improving the 
optimization APIs to make them more pluggable, extensible, etc.  This JIRA is a 
place to discuss what API changes are necessary for the long term, and to 
provide links to other relevant JIRAs.

Eventually, I hope this leads to a design doc outlining:
* current issues
* requirements such as supporting many types of objective functions, 
optimization algorithms, and parameters to those algorithms
* ideal API
* breakdown of smaller JIRAs needed to achieve that API

I will soon create an initial design doc, and I will try to watch this JIRA and 
include ideas from JIRA comments.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener

2015-01-14 Thread Mingyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277849#comment-14277849
 ] 

Mingyu Kim commented on SPARK-4906:
---

"typically once a few tasks have failed the stage will fail": Is that actually 
true? Doesn't a stage with N tasks run until all N tasks run (and potentially 
fail)? We have datasets that are thousands of partitions each, so that can 
totally add up to 50k failed tasks over time.

We can certainly allocate more memory, but the concern is that for long-running 
SparkContexts, this memory requirement will grow linearly over time, assuming 
the failure happens at a fixed frequency. So, we will run into OOMs no matter 
how large the heap is as long as we have an enough number of failures over time.

> Spark master OOMs with exception stack trace stored in JobProgressListener
> --
>
> Key: SPARK-4906
> URL: https://issues.apache.org/jira/browse/SPARK-4906
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Mingyu Kim
>
> Spark master was OOMing with a lot of stack traces retained in 
> JobProgressListener. The object dependency goes like the following.
> JobProgressListener.stageIdToData => StageUIData.taskData => 
> TaskUIData.errorMessage
> Each error message is ~10kb since it has the entire stack trace. As we have a 
> lot of tasks, when all of the tasks across multiple stages go bad, these 
> error messages accounted for 0.5GB of heap at some point.
> Please correct me if I'm wrong, but it looks like all the task info for 
> running applications are kept in memory, which means it's almost always bound 
> to OOM for long-running applications. Would it make sense to fix this, for 
> example, by spilling some UI states to disk?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1420#comment-1420
 ] 

RJ Nowling commented on SPARK-4894:
---

Hi [~josephkb], lots to think about!

In general, I'm a big fan of multiple small changes over time rather than one 
big change.  They're easier to verify and review.  Since MLLib is going through 
an interface refactoring to become ML anyway, we can focus on the Bernoulli NB 
change now and worry about a redesign of the API later.

What do you have in mind for other feature and label types?  I briefly reviewed 
Factorie -- their concept of Factors may be over complicated for Naive Bayes 
but I want to learn more about your ideas.  Do you have a few concrete examples 
of how Factors could be used with NB?  And for continuous labels, are you 
thinking of something like the Gaussian NB in sklearn?

>From bioinformatics, I know that folks tend to encode categorical variables 
>incorrectly.  E.g., for a DNA sequence consisting of A, T, C, G, and possibly 
>gaps, each position in a sequence should be encoded as four (five) features, 
>one for each nucleotide.  When folks try to represent each position as one 
>feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect 
>distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from 
>CTT by 2. By using one feature for each of the four (five) possibilities, you 
>get correct distances and can even weight mutations and deletions using BLOSUM 
>matrices and such.  For this type of case, I think the solution there is 
>education and documentation, not complicated type systems.




> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277768#comment-14277768
 ] 

Apache Spark commented on SPARK-5254:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/4052

> Update the user guide to make clear that spark.mllib is not being deprecated
> 
>
> Key: SPARK-5254
> URL: https://issues.apache.org/jira/browse/SPARK-5254
> Project: Spark
>  Issue Type: Documentation
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> The current statement in the user guide may deliver confusing messages to 
> users. spark.ml contains high-level APIs for building ML pipelines. But it 
> doesn't mean that spark.mllib is being deprecated.
> First of all, the pipeline API is in its alpha stage and we need to see more 
> use cases from the community to stabilizes it, which may take several 
> releases. Secondly, the components in spark.ml are simple wrappers over 
> spark.mllib implementations. Neither the APIs or the implementations from 
> spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs 
> to build their ML pipelines, but we will keep supporting and adding features 
> to spark.mllib. For example, there are many features in review at 
> https://spark-prs.appspot.com/#mllib. So users should be comfortable with 
> using spark.mllib features and expect more coming. The user guide needs to be 
> updated to make the message clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5255) Use python doc "note" for experimental tags in tree.py

2015-01-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5255:
-
Description: 
spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in 
the docs.  Those should use:
{code}
.. note:: Experimental
{code}
(as in, e.g., feature.py)


  was:
spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in 
the docs.  Those should use:
```
.. note:: Experimental
```
(as in, e.g., feature.py)



> Use python doc "note" for experimental tags in tree.py
> --
>
> Key: SPARK-5255
> URL: https://issues.apache.org/jira/browse/SPARK-5255
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Trivial
>
> spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags 
> in the docs.  Those should use:
> {code}
> .. note:: Experimental
> {code}
> (as in, e.g., feature.py)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5255) Use python doc "note" for experimental tags in tree.py

2015-01-14 Thread Joseph K. Bradley (JIRA)

Joseph K. Bradley created SPARK-5255:


 Summary: Use python doc "note" for experimental tags in tree.py
 Key: SPARK-5255
 URL: https://issues.apache.org/jira/browse/SPARK-5255
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, PySpark
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley
Priority: Trivial


spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in 
the docs.  Those should use:
```
.. note:: Experimental
```
(as in, e.g., feature.py)




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277716#comment-14277716
 ] 

Apache Spark commented on SPARK-4585:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4051

> Spark dynamic executor allocation shouldn't use maxExecutors as initial number
> --
>
> Key: SPARK-4585
> URL: https://issues.apache.org/jira/browse/SPARK-4585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.1.0
>Reporter: Chengxiang Li
>
> With SPARK-3174, one can configure a minimum and maximum number of executors 
> for a Spark application on Yarn. However, the application always starts with 
> the maximum. It seems more reasonable, at least for Hive on Spark, to start 
> from the minimum and scale up as needed up to the maximum.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated

2015-01-14 Thread Xiangrui Meng (JIRA)

Xiangrui Meng created SPARK-5254:


 Summary: Update the user guide to make clear that spark.mllib is 
not being deprecated
 Key: SPARK-5254
 URL: https://issues.apache.org/jira/browse/SPARK-5254
 Project: Spark
  Issue Type: Documentation
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


The current statement in the user guide may deliver confusing messages to 
users. spark.ml contains high-level APIs for building ML pipelines. But it 
doesn't mean that spark.mllib is being deprecated.

First of all, the pipeline API is in its alpha stage and we need to see more 
use cases from the community to stabilizes it, which may take several releases. 
Secondly, the components in spark.ml are simple wrappers over spark.mllib 
implementations. Neither the APIs or the implementations from spark.mllib are 
being deprecated. We expect users use spark.ml pipeline APIs to build their ML 
pipelines, but we will keep supporting and adding features to spark.mllib. For 
example, there are many features in review at 
https://spark-prs.appspot.com/#mllib. So users should be comfortable with using 
spark.mllib features and expect more coming. The user guide needs to be updated 
to make the message clear.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-14 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3726:
-
Assignee: Manoj Kumar

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5199) Input metrics should show up for InputFormats that return CombineFileSplits

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277704#comment-14277704
 ] 

Apache Spark commented on SPARK-5199:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/4050

> Input metrics should show up for InputFormats that return CombineFileSplits
> ---
>
> Key: SPARK-5199
> URL: https://issues.apache.org/jira/browse/SPARK-5199
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277702#comment-14277702
 ] 

Joseph K. Bradley commented on SPARK-4894:
--

[~rnowling] Thanks for looking into this issue!  I was thinking about 2 
possibilities for generalizing NaiveBayes:
* Specify the model type with simple strings, and keep current API
** This is simpler and maintains API stability.
* Generalize the model to allow other feature and label types
** This is what we really should do long-term since NaiveBayes should not be 
limited to discrete labels.
** This would require more work, including defining a Factor concept (using the 
terminology from Probabilistic Graphical Models) and updating the NaiveBayes 
API to use factors.  Different factors would handle discrete and/or continuous 
variables, and would encode different types of distributions.  This setup is 
common in graphical model libraries like [Factorie| 
http://factorie.cs.umass.edu/index.html].
** Alternative: We could separate NaiveBayes into 2 classes based on discrete 
and continuous labels, but that might be even more work in the long term (to 
maintain 2 copies of the API).

What do you think?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge

2015-01-14 Thread DB Tsai (JIRA)

DB Tsai created SPARK-5253:
--

 Summary: LinearRegression with L1/L2 (elastic net) using OWLQN in 
new ML pacakge
 Key: SPARK-5253
 URL: https://issues.apache.org/jira/browse/SPARK-5253
 Project: Spark
  Issue Type: New Feature
  Components: ML
Reporter: DB Tsai






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277656#comment-14277656
 ] 

Apache Spark commented on SPARK-5193:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/4049

> Make Spark SQL API usable in Java and remove the Java-specific API
> --
>
> Key: SPARK-5193
> URL: https://issues.apache.org/jira/browse/SPARK-5193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Java version of the SchemaRDD API causes high maintenance burden for Spark 
> SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support 
> both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it 
> usable for Java, and then we can remove the Java specific version. 
> Things to remove include (Java version of):
> - data type
> - Row
> - SQLContext
> - HiveContext
> Things to consider:
> - Scala and Java have a different collection library.
> - Scala and Java (8) have different closure interface.
> - Scala and Java can have duplicate definitions of common classes, such as 
> BigDecimal.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2015-01-14 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277651#comment-14277651
 ] 

Imran Rashid commented on SPARK-4746:
-

btw if anybody else wants to futz around with the times of the tests, here is 
my hacky script to parse the sbt output to get the test times, requires a bit 
of manual filtering of the sbt output as well:

https://gist.github.com/squito/54afab78f41dd11312ad

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2015-01-14 Thread Imran Rashid (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277646#comment-14277646
 ] 

Imran Rashid commented on SPARK-4746:
-

submitted a PR: https://github.com/apache/spark/pull/4048

My decision of what to choose as IntegrationTests was somewhat arbitrary.  But 
it does have the benefit that (a) all unit tests now run in about 5 mins on my 
laptop and (b) you don't need to do a full {{mvn package}} before running the 
unit tests.

I took a brief look at what was taking up the time in the remaining tests.  
Most of the tests are pretty fast, there are a few that are taking up most of 
the time.  In fact, it looks like we can get the time down by ~50% if we move 
the slowest 10 remaining tests to Integration tests as well:

ConnectionManagerSuite: - sendMessageReliably timeout 
MapOutputTrackerSuite: - remote fetch exceeds akka frame size 
RDDSuite: - takeSample 
XORShiftRandomSuite - XORShift generates valid random numbers 
ExternalSorterSuite: - cleanup of intermediate files in sorter 
SparkListenerSuite: - local metrics 
ContextCleanerSuite: - automatically cleanup RDD + shuffle + broadcast 
ExternalSorterSuite: - cleanup of intermediate files in sorter, bypass 
merge-sort 
ExternalSorterSuite: - sorting without aggregation, with spill 
ExternalSorterSuite: - empty partitions with spilling 
TaskSetManagerSuite: - abort the job if total size of results is too large 

of course you have to go a little bit deeper to get more gains, but still you 
could go a lot further eg. if you move the 60 slowest then you're down to 25% 
of the time.  This may be opening the door to an endless debate about where we 
draw the line on what is a unit test vs integration test.  But one nice thing 
about the test-tag based approach is that we could go finer grained if we want, 
and its pretty easy for a developer to customize which set of tests they want 
to run when developing locally.

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
 ] 

RJ Nowling edited comment on SPARK-4894 at 1/14/15 8:50 PM:


Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py 
(Look at `_joint_log_likelihood` in the `MultinomialNB` and `BernoulliNB` 
classes.)

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~lmcguire], [~josephkb] Any thoughts or comments on any of the 
above?


was (Author: rnowling):
Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options

2015-01-14 Thread Manoj Kumar (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277643#comment-14277643
 ] 

Manoj Kumar commented on SPARK-3726:


[~josephkb] You seem to report issues that I always think I can have a decent 
shot at :) I would like to submit a PR for this by the end of the week.

> RandomForest: Support for bootstrap options
> ---
>
> Key: SPARK-3726
> URL: https://issues.apache.org/jira/browse/SPARK-3726
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
> The expected size of each sample is the same as the original data (sampling 
> rate = 1.0), and sampling is done with replacement.  Adding support for other 
> sampling rates and for sampling without replacement would be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277633#comment-14277633
 ] 

Apache Spark commented on SPARK-4746:
-

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/4048

> integration tests should be separated from faster unit tests
> 
>
> Key: SPARK-4746
> URL: https://issues.apache.org/jira/browse/SPARK-4746
> Project: Spark
>  Issue Type: Bug
>Reporter: Imran Rashid
>Priority: Trivial
>
> Currently there isn't a good way for a developer to skip the longer 
> integration tests.  This can slow down local development.  See 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html
> One option is to use scalatest's notion of test tags to tag all integration 
> tests, so they could easily be skipped



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread RJ Nowling (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631
 ] 

RJ Nowling commented on SPARK-4894:
---

Thanks [~lmcguire]!  I'll wait until next week in case you have time to put a 
patch together.

In the mean time, here were my thoughts for changes:
1. Add an optional `model` variable to the `NaiveBayes` object and class and 
`NaiveBayesModel`. It would be a string with a default value of `Multinomial`.  
For Bernoulli, we can use `Bernoulli`.

2.  In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta 
* testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * 
testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - 
exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for 
the 0-valued features.   (Breeze may not allow adding/subtracting scalars and 
vectors/matrices.)

In the current model, no term is added for rows of `testData` that have 0 
entries.  In the Bernoulli model, we would be adding a separate term for 
0-valued features.

Here is the sklearn source for comparison: 
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py

Note that sklearn adds the neg prob to all features and subtracts it from 
features with 1-values.

[~mengxr], [~josephkb] Any thoughts or comments?

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277611#comment-14277611
 ] 

Alex Baretta commented on SPARK-5235:
-

[~rxin] I see your point. Well, listen, I appreciate you merging my PR. I can 
definitely rework my code to get the SQLContext out of the picture. Still, 
before you commit a change removing the Serializable trait from SQLContext you 
probably want to let everyone know by dropping a note the dev list first, to 
give people time to refactor their code.

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Leah McGuire (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277590#comment-14277590
 ] 

Leah McGuire commented on SPARK-4894:
-

Thanks [~rnowling]!

I can take a look at it this week and at least confirm that the docs are 
incorrect and outline the changes that need to occur. If I have time I will try 
to send you a patch this week, if not I'm happy to help with review and testing 
for your patch.

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Xiangrui Meng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277578#comment-14277578
 ] 

Xiangrui Meng commented on SPARK-4894:
--

[~rnowling] I've assigned this to you. Let's also have [~josephkb] in the loop, 
who mentioned this multiple times offline. It would be good if you can briefly 
describe your plan, e.g., API changes and the implementation. Thanks!

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4894:
-
Assignee: RJ Nowling

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4894) Add Bernoulli-variant of Naive Bayes

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-4894:
-
Affects Version/s: (was: 1.1.1)
   1.2.0

> Add Bernoulli-variant of Naive Bayes
> 
>
> Key: SPARK-4894
> URL: https://issues.apache.org/jira/browse/SPARK-4894
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: RJ Nowling
>Assignee: RJ Nowling
>
> MLlib only supports the multinomial-variant of Naive Bayes.  The Bernoulli 
> version of Naive Bayes is more useful for situations where the features are 
> binary values.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5234:
-
Assignee: yuhao yang

> examples for ml don't have sparkContext.stop
> 
>
> Key: SPARK-5234
> URL: https://issues.apache.org/jira/browse/SPARK-5234
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
> Environment: all
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Not sure why sc.stop() is not in the 
> org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
> SimpleTextClassificationPipeline}. 
> I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5234:
-
Target Version/s: 1.3.0, 1.2.1  (was: 1.3.0)

> examples for ml don't have sparkContext.stop
> 
>
> Key: SPARK-5234
> URL: https://issues.apache.org/jira/browse/SPARK-5234
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
> Environment: all
>Reporter: yuhao yang
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Not sure why sc.stop() is not in the 
> org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
> SimpleTextClassificationPipeline}. 
> I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-5234:
-
Fix Version/s: 1.2.1

> examples for ml don't have sparkContext.stop
> 
>
> Key: SPARK-5234
> URL: https://issues.apache.org/jira/browse/SPARK-5234
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
> Environment: all
>Reporter: yuhao yang
>Priority: Trivial
> Fix For: 1.3.0, 1.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Not sure why sc.stop() is not in the 
> org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
> SimpleTextClassificationPipeline}. 
> I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5234) examples for ml don't have sparkContext.stop

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5234.
--
Resolution: Fixed

Issue resolved by pull request 4044
[https://github.com/apache/spark/pull/4044]

> examples for ml don't have sparkContext.stop
> 
>
> Key: SPARK-5234
> URL: https://issues.apache.org/jira/browse/SPARK-5234
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.2.0
> Environment: all
>Reporter: yuhao yang
>Priority: Trivial
> Fix For: 1.3.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Not sure why sc.stop() is not in the 
> org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, 
> SimpleTextClassificationPipeline}. 
> I can prepare a PR if it's not intentional to omit the call to stop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277561#comment-14277561
 ] 

Reynold Xin commented on SPARK-5235:


I will merge your PR and we can continue having this discussion. 

For 1.3 we do plan to stabilize Spark SQL external APIs. A lot of APIs will 
actually change (we will likely end up breaking the programming API's binary 
compatibility and source compatibility) to make a better API. This is actually 
a good chance to do it. Once we are done with that, Spark SQL can graduate from 
alpha and APIs will no longer change.

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-14 Thread vincent ye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277554#comment-14277554
 ] 

vincent ye commented on SPARK-5206:
---

Hi Tathagata,
Accumulator object is created after the StreamingContext (ssc) created using:
val counter = ssc.sparkContext.accumulator()
The way I create the recoverable ssc is like this:
val ssc = StreamingContext.getOrCreate("/spark/applicationName", 
functionToCreateContext())
def functionToCreateContext =
{ val conf = new 
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new 
StreamingContext(conf, Seconds(1)) val counter = 
ssc.sparkContext.accumulator(0L, "message received") . 
dstream.foreach(counter += 1) .. ssc }
ssc.start
ssc.awaitTermination()
If the app is recovering from checkpoint, It won't execute 
functionToCreateContext. Then the counter of Accumulator won't be instantiated 
and registered to Accumulators singleton object. But the counter inside 
dstream.foreach(counter += 1) will be created by deserializer on a remote 
worker which won't register the counter to the driver.
I don't understand what you mean explicity referencing the Accumulator object 
..
Thanks

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> I guess that an Accumulator is registered to a singleton Accumulators in Line 
> 58 of org.apache.spark.Accumulable:
> Accumulators.register(this, true)
> This code need to be executed in the driver once. But when the application is 
> recovered from checkpoint. It won't be executed in the driver. So when the 
> driver process it at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
>  It can't find the Accumulator because it's not re-register during the 
> recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-14 Thread vincent ye (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277552#comment-14277552
 ] 

vincent ye commented on SPARK-5206:
---

Hi Tathagata,
Accumulator object is created after the StreamingContext (ssc) created using:
val counter = ssc.sparkContext.accumulator()

The way I create the recoverable ssc is like this:

val ssc = StreamingContext.getOrCreate("/spark/applicationName", 
functionToCreateContext())
def functionToCreateContext = {
   val conf = new 
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
   val ssc = new StreamingContext(conf, Seconds(1))
   val counter = ssc.sparkContext.accumulator(0L, "message received")
   .
   dstream.foreach(counter += 1)
   ..
   ssc
}

ssc.start
ssc.awaitTermination()


If the app is recovering from checkpoint, It won't execute 
functionToCreateContext. Then the counter of Accumulator won't be instantiated 
and registered to Accumulators singleton object. But the counter inside 
dstream.foreach(counter += 1) will be created by deserializer on a remote 
worker which won't register the counter to the driver.

I don't understand what you mean explicity referencing the Accumulator object 
..



Thanks
 

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> I guess that an Accumulator is registered to a singleton Accumulators in Line 
> 58 of org.apache.spark.Accumulable:
> Accumulators.register(this, true)
> This code need to be executed in the driver once. But when the application is 
> recovered from checkpoint. It won't be executed in the driver. So when the 
> driver process it at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
>  It can't find the Accumulator because it's not re-register during the 
> recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277555#comment-14277555
 ] 

Alex Baretta commented on SPARK-5235:
-

[~rxin] Could be. All I'm saying is that your change was not intended to make 
SQLContext not Serializable. Even if we all agree that it would be cleaner, 
"taking the opportunity" offered by this regression to remove the Serializable 
trait from SQLContext is not a good idea, as there is no emergency here. 
Printing a warning, writing something in docs of version 1.3 and then waiting 
until 1.4 would be a better process. 

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277553#comment-14277553
 ] 

Sean Owen commented on SPARK-5235:
--

[~alexbaretta] I suppose my point is that no code can possibly rely on 
serializing {{SQLContext}}, since even before the recent change, it would 
deserialize with no {{SparkContext}} and be unable to do anything. It can only 
be accidentally serialized. The issue is that it would turn a silent error into 
an explicit one, yes. I'm also not 100% sure there's not some binary 
compatibility issue. If not, yes perhaps somehow stick in a warning by 
overriding the Java serialization mechanism method to print it.

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint

2015-01-14 Thread vincent ye (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

vincent ye updated SPARK-5206:
--
Comment: was deleted

(was: Hi Tathagata,
Accumulator object is created after the StreamingContext (ssc) created using:
val counter = ssc.sparkContext.accumulator()

The way I create the recoverable ssc is like this:

val ssc = StreamingContext.getOrCreate("/spark/applicationName", 
functionToCreateContext())
def functionToCreateContext = {
   val conf = new 
SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
   val ssc = new StreamingContext(conf, Seconds(1))
   val counter = ssc.sparkContext.accumulator(0L, "message received")
   .
   dstream.foreach(counter += 1)
   ..
   ssc
}

ssc.start
ssc.awaitTermination()


If the app is recovering from checkpoint, It won't execute 
functionToCreateContext. Then the counter of Accumulator won't be instantiated 
and registered to Accumulators singleton object. But the counter inside 
dstream.foreach(counter += 1) will be created by deserializer on a remote 
worker which won't register the counter to the driver.

I don't understand what you mean explicity referencing the Accumulator object 
..



Thanks
 )

> Accumulators are not re-registered during recovering from checkpoint
> 
>
> Key: SPARK-5206
> URL: https://issues.apache.org/jira/browse/SPARK-5206
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.1.0
>Reporter: vincent ye
>
> I got exception as following while my streaming application restarts from 
> crash from checkpoit:
> 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR 
> scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, 
> 4)
> java.util.NoSuchElementException: key not found: 1
>   at scala.collection.MapLike$class.default(MapLike.scala:228)
>   at scala.collection.AbstractMap.default(Map.scala:58)
>   at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)
>   at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)
>   at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)
>   at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> I guess that an Accumulator is registered to a singleton Accumulators in Line 
> 58 of org.apache.spark.Accumulable:
> Accumulators.register(this, true)
> This code need to be executed in the driver once. But when the application is 
> recovered from checkpoint. It won't be executed in the driver. So when the 
> driver process it at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938),
>  It can't find the Accumulator because it's not re-register during the 
> recovery.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-4014) TaskContext.attemptId returns taskId

2015-01-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-4014:
--
Target Version/s: 1.0.3, 1.1.2, 1.2.1
Assignee: Josh Rosen
  Labels: backport-needed  (was: )

> TaskContext.attemptId returns taskId
> 
>
> Key: SPARK-4014
> URL: https://issues.apache.org/jira/browse/SPARK-4014
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Assignee: Josh Rosen
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> In TaskRunner, we assign the taskId of a task to the attempId of the 
> corresponding TaskContext. Should we rename attemptId to taskId to avoid 
> confusion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-4014) TaskContext.attemptId returns taskId

2015-01-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-4014.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 3849
[https://github.com/apache/spark/pull/3849]

> TaskContext.attemptId returns taskId
> 
>
> Key: SPARK-4014
> URL: https://issues.apache.org/jira/browse/SPARK-4014
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Yin Huai
>Priority: Minor
>  Labels: backport-needed
> Fix For: 1.3.0
>
>
> In TaskRunner, we assign the taskId of a task to the attempId of the 
> corresponding TaskContext. Should we rename attemptId to taskId to avoid 
> confusion?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5235:
---
Summary: Determine serializability of SQLContext  (was: 
java.io.NotSerializableException: org.apache.spark.sql.SQLConf)

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5235) Determine serializability of SQLContext

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5235:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-5166

> Determine serializability of SQLContext
> ---
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277538#comment-14277538
 ] 

Joseph K. Bradley commented on SPARK-5019:
--

[~lewuathe]  Will you be able to update yours soon?  It would be great to get 
this patch in right away since it will help finalize the public API.  Thanks!  
If you don't have time, please let us know so that [~tgaloppo] can submit his.

> Update GMM API to use MultivariateGaussian
> --
>
> Key: SPARK-5019
> URL: https://issues.apache.org/jira/browse/SPARK-5019
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> The GaussianMixtureModel API should expose MultivariateGaussian instances 
> instead of the means and covariances.  This should be fixed as soon as 
> possible to stabilize the API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277539#comment-14277539
 ] 

Reynold Xin commented on SPARK-5235:


How would we deprecate that though? Are you suggesting we print a runtime 
warning when SQLContext is being serialized?


> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277533#comment-14277533
 ] 

Alex Baretta commented on SPARK-5235:
-

[~sowen] I would much rather the decision of making SQLContext not Serializable 
go through some review. Usually API changes are preceded by a time of 
"deprecation" of the older API. I don't think there is any emergent need here 
to break code relying on the serializability of SQLContext without warning. 
But, yes, longer term I'd want to structure my code so that the SQLContext is 
not commigled with objects that need to be shipped around.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2015-01-14 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277485#comment-14277485
 ] 

Joseph K. Bradley commented on SPARK-3717:
--

[~bbnsumanth]  Please do not misunderstand: We appreciate your efforts towards 
contributing.  However, when someone starts contributing to Spark, it is 
important for them to start with smaller tasks to get used to the PR review 
process, code style requirements, etc.  We asked you to follow this convention 
from the beginning and in multiple messages, but I do not see any responses 
from you to our suggestions.

The main issue is that we and other contributors have limited time.  If a patch 
is very large and requires lots of reviewing, it is a very large time 
commitment.  If you become a regular contributor starting with small patches, 
then other reviewers will get used to reviewing your code and will be more 
willing to commit the many hours required for reviewing a large feature.

Here is what I would recommend:
* Keep your current code.  You could even prepare it as a package for Spark if 
you would like others to start using it right away.
* Start contributing using small patches.  Get used to the review process and 
coding conventions.
* Re-submit your random forest implementation once you have done other smaller 
contributions.

I hope you do not get discouraged from contributing.  We simply need to follow 
regular contribution procedures to make this very large project manageable.  
Please refer to 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] if it 
is useful.

> DecisionTree, RandomForest: Partition by feature
> 
>
> Key: SPARK-3717
> URL: https://issues.apache.org/jira/browse/SPARK-3717
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> h1. Summary
> Currently, data are partitioned by row/instance for DecisionTree and 
> RandomForest.  This JIRA argues for partitioning by feature for training deep 
> trees.  This is especially relevant for random forests, which are often 
> trained to be deeper than single decision trees.
> h1. Details
> Dataset dimensions and the depth of the tree to be trained are the main 
> problem parameters determining whether it is better to partition features or 
> instances.  For random forests (training many deep trees), partitioning 
> features could be much better.
> Notation:
> * P = # workers
> * N = # instances
> * M = # features
> * D = depth of tree
> h2. Partitioning Features
> Algorithm sketch:
> * Each worker stores:
> ** a subset of columns (i.e., a subset of features).  If a worker stores 
> feature j, then the worker stores the feature value for all instances (i.e., 
> the whole column).
> ** all labels
> * Train one level at a time.
> * Invariants:
> ** Each worker stores a mapping: instance → node in current level
> * On each iteration:
> ** Each worker: For each node in level, compute (best feature to split, info 
> gain).
> ** Reduce (P x M) values to M values to find best split for each node.
> ** Workers who have features used in best splits communicate left/right for 
> relevant instances.  Gather total of N bits to master, then broadcast.
> * Total communication:
> ** Depth D iterations
> ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
> (1 bit each).
> ** Estimate: D * (M * 8 + N)
> h2. Partitioning Instances
> Algorithm sketch:
> * Train one group of nodes at a time.
> * Invariants:
>  * Each worker stores a mapping: instance → node
> * On each iteration:
> ** Each worker: For each instance, add to aggregate statistics.
> ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
> *** (“# classes” is for classification.  3 for regression)
> ** Reduce aggregate.
> ** Master chooses best split for each node in group and broadcasts.
> * Local training: Once all instances for a node fit on one machine, it can be 
> best to shuffle data and training subtrees locally.  This can mean shuffling 
> the entire dataset for each tree trained.
> * Summing over all iterations, reduce to total of:
> ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
> ** Estimate: 2^D * M * B * C * 8
> h2. Comparing Partitioning Methods
> Partitioning features cost < partitioning instances cost when:
> * D * (M * 8 + N) < 2^D * M * B * C * 8
> * D * N < 2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
> right hand side)
> * N < [ 2^D * M * B * C * 8 ] / D
> Example: many instances:
> * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
> 5)
> * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
> * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8

[jira] [Updated] (SPARK-5228) Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty

2015-01-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-5228:
--
Assignee: Kousuke Saruta

> Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty
> 
>
> Key: SPARK-5228
> URL: https://issues.apache.org/jira/browse/SPARK-5228
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages 
> and Failed Stages are hidden when they are empty while tables for Active 
> Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5228) Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty

2015-01-14 Thread Josh Rosen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-5228.
---
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4028
[https://github.com/apache/spark/pull/4028]

> Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty
> 
>
> Key: SPARK-5228
> URL: https://issues.apache.org/jira/browse/SPARK-5228
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
> Fix For: 1.3.0
>
>
> In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages 
> and Failed Stages are hidden when they are empty while tables for Active 
> Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277475#comment-14277475
 ] 

Sean Owen commented on SPARK-5235:
--

[~alexbaretta] Well at least that explains why tests didn't fail, that would 
have been really puzzling. This also at least gives you a quick route to 
getting your program working and it's more reliable. In a way it's almost a 
good thing, to me, that it happens to no longer serialize. Separately, I'd be 
curious what the API-compatibility implications are of making the class not 
extend Serializable and leave it at that.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2909) Indexing for SparseVector in pyspark

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2909:
-
Assignee: Manoj Kumar

> Indexing for SparseVector in pyspark
> 
>
> Key: SPARK-2909
> URL: https://issues.apache.org/jira/browse/SPARK-2909
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Manoj Kumar
>Priority: Minor
> Fix For: 1.3.0
>
>
> SparseVector in pyspark does not currently support indexing, except by 
> examining the internal representation.  Though indexing is a pricy operation, 
> it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) 
> and operating on a single feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277460#comment-14277460
 ] 

Alex Baretta commented on SPARK-5235:
-

[~rxin] [~sowen] My bad! Indeed the SQLContext gets captured in a closure in my 
own code, passed to a flatMap. I did not notice this previously because the 
SQLContext was Serializable.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2909) Indexing for SparseVector in pyspark

2015-01-14 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-2909.
--
   Resolution: Fixed
Fix Version/s: 1.3.0

Issue resolved by pull request 4025
[https://github.com/apache/spark/pull/4025]

> Indexing for SparseVector in pyspark
> 
>
> Key: SPARK-2909
> URL: https://issues.apache.org/jira/browse/SPARK-2909
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
> Fix For: 1.3.0
>
>
> SparseVector in pyspark does not currently support indexing, except by 
> examining the internal representation.  Though indexing is a pricy operation, 
> it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) 
> and operating on a single feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib

2015-01-14 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277449#comment-14277449
 ] 

Apache Spark commented on SPARK-1405:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/4047

> parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
> -
>
> Key: SPARK-1405
> URL: https://issues.apache.org/jira/browse/SPARK-1405
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Xusen Yin
>Assignee: Guoqiang Li
>Priority: Critical
>  Labels: features
> Attachments: performance_comparison.png
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts 
> topics from text corpus. Different with current machine learning algorithms 
> in MLlib, instead of using optimization algorithms such as gradient desent, 
> LDA uses expectation algorithms such as Gibbs sampling. 
> In this PR, I prepare a LDA implementation based on Gibbs sampling, with a 
> wholeTextFiles API (solved yet), a word segmentation (import from Lucene), 
> and a Gibbs sampling core.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277414#comment-14277414
 ] 

Alex Baretta commented on SPARK-5235:
-

[~rxin] I'm sorry to say it's not that easy, especially since I can't really 
run any of my code off of master: I need to apply a few outstanding PRs 
pertaining to Parquet in addition to this one to get any of my code to run at 
all. I'll try to reconstruct the configuration that was failing, but it will 
take a while.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-01-14 Thread Shivaram Venkataraman (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277334#comment-14277334
 ] 

Shivaram Venkataraman commented on SPARK-3821:
--

Regarding the pre-built distributions, AFAIK we don't support full Hadoop2 as 
in YARN. We run CDH4 which has some parts of Hadoop2, but with MapReduce. There 
is an open PR to add support for Hadoop2 at 
https://github.com/mesos/spark-ec2/pull/77 and and you can see that it gets the 
right [prebuilt 
Spark|https://github.com/mesos/spark-ec2/pull/77/files#diff-1d040c3294246f2b59643d63868fc2adR97]
 in that case 

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType

2015-01-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277290#comment-14277290
 ] 

Reynold Xin commented on SPARK-5211:


BTW who are the developers using it?

> Restore HiveMetastoreTypes.toDataType
> -
>
> Key: SPARK-5211
> URL: https://issues.apache.org/jira/browse/SPARK-5211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType

2015-01-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277292#comment-14277292
 ] 

Reynold Xin commented on SPARK-5211:


I'm under the impression that everything in the Hive package that is not 
HiveContext should not have been a public API.

> Restore HiveMetastoreTypes.toDataType
> -
>
> Key: SPARK-5211
> URL: https://issues.apache.org/jira/browse/SPARK-5211
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Critical
>
> It was a public API. Since developers are using it, we need to get it back.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5245) Move Decimal from types.decimal to types package

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5245:
---
Summary: Move Decimal from types.decimal to types package  (was: Move 
Decimal and Date into o.a.s.sql.types)

> Move Decimal from types.decimal to types package
> 
>
> Key: SPARK-5245
> URL: https://issues.apache.org/jira/browse/SPARK-5245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
> Fix For: 1.3.0
>
>
> We probably don't want to (and want to) create a new package for each type. 
> We can just put them in org.apache.spark.sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5245) Move Decimal from types.decimal to types package

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5245.

   Resolution: Fixed
Fix Version/s: 1.3.0

Fixed in https://github.com/apache/spark/pull/4041

> Move Decimal from types.decimal to types package
> 
>
> Key: SPARK-5245
> URL: https://issues.apache.org/jira/browse/SPARK-5245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
> Fix For: 1.3.0
>
>
> We probably don't want to (and want to) create a new package for each type. 
> We can just put them in org.apache.spark.sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5245) Move Decimal from types.decimal to types package

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5245:
---
Assignee: Adrian Wang

> Move Decimal from types.decimal to types package
> 
>
> Key: SPARK-5245
> URL: https://issues.apache.org/jira/browse/SPARK-5245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Adrian Wang
> Fix For: 1.3.0
>
>
> We probably don't want to (and want to) create a new package for each type. 
> We can just put them in org.apache.spark.sql.types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-5248) moving Decimal from types.decimal to types package

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-5248.

   Resolution: Fixed
Fix Version/s: 1.3.0
 Assignee: Adrian Wang

Fixed in https://github.com/apache/spark/pull/4041

> moving Decimal from types.decimal to types package
> --
>
> Key: SPARK-5248
> URL: https://issues.apache.org/jira/browse/SPARK-5248
> Project: Spark
>  Issue Type: Improvement
>Reporter: Adrian Wang
>Assignee: Adrian Wang
>Priority: Minor
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277270#comment-14277270
 ] 

Reynold Xin commented on SPARK-5235:


I can merge your PR soon, but [~alexbaretta] can you also run with 
-Dsun.io.serialization.extendeddebuginfo=true as [~srowen] suggested and find 
the cause of that? That should take only a minute. 

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> {code}
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5235:
---
Description: 
The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
the stack trace I get when running SQL queries against a Parquet file.

{code}
Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.spark.sql.SQLConf
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
{code}

  was:
The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
the stack trace I get when running SQL queries against a Parquet file.

Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
due to stage failure: Task not serializable: java.io.NotSerializableException: 
org.apache.spark.sql.SQLConf
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
at 
org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
at akka.actor.ActorCell.invoke(ActorCell.scala:487)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
at akka.dispatch.Mailbox.run(Mailbox.scala:220)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)


> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
>

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277258#comment-14277258
 ] 

Alex Baretta commented on SPARK-5235:
-

Yes, there is a need for a hotfix. [~rxin] committed a change les than 24 hours 
ago to SQLContext, splitting out SQLConf as a field in SQLContext instead of a 
mixin. The new field should have been declared either serializable or transient.

https://github.com/apache/spark/commit/14e3f114efb906937b2d7b7ac04484b2814a3b48

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277251#comment-14277251
 ] 

Sean Owen commented on SPARK-5235:
--

@Alex Baretta what version worked? If you're saying a recent change completely 
broke this, a quick hotfix is appropriate. However the Jenkins tests are 
passing (modulo the usual flakes): 
https://amplab.cs.berkeley.edu/jenkins/view/Spark/  So I wonder if there's a 
hurry to put this band-aid on. That extended debug info might point straight to 
an easy real fix; more info on what triggers this would be good to compare with 
the unit tests that are working. If it turns out to be a mess, sure, the 
band-aid might be better than nothing int he short term.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs

2015-01-14 Thread Lutz Buech (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lutz Buech updated SPARK-5252:
--
Attachment: debug.txt

log at DEBUG level

> Streaming StatefulNetworkWordCount example hangs
> 
>
> Key: SPARK-5252
> URL: https://issues.apache.org/jira/browse/SPARK-5252
> Project: Spark
>  Issue Type: Bug
>  Components: Streaming
>Affects Versions: 1.2.0
> Environment: Ubuntu Linux
>Reporter: Lutz Buech
> Attachments: debug.txt
>
>
> Running the stateful network word count example in Python (on one local node):
> https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py
> At the beginning, when no data is streamed, empty status outputs are 
> generated, only decorated by the current Time, e.g.:
> ---
> Time: 2015-01-14 17:58:20
> ---
> ---
> Time: 2015-01-14 17:58:21
> ---
> As soon as I stream some data via netcat, no new status updates will show. 
> Instead, one line saying
> [Stage :> 
>  (2 + 0) / 3]
> where  is some integer number, e.g. 132. There is no further output 
> on stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Alex Baretta (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277240#comment-14277240
 ] 

Alex Baretta commented on SPARK-5235:
-

[~sowen] I agree with you that contexts have no business getting serialized and 
shipped around the cluster. That being said, this issue is a regression, as 
this simply used to work. Since this is a regression, it is very much 
appropriate to fix the issue by restoring the previous behavior, and then take 
time to think of a better design.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs

2015-01-14 Thread Lutz Buech (JIRA)

Lutz Buech created SPARK-5252:
-

 Summary: Streaming StatefulNetworkWordCount example hangs
 Key: SPARK-5252
 URL: https://issues.apache.org/jira/browse/SPARK-5252
 Project: Spark
  Issue Type: Bug
  Components: Streaming
Affects Versions: 1.2.0
 Environment: Ubuntu Linux
Reporter: Lutz Buech


Running the stateful network word count example in Python (on one local node):
https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py

At the beginning, when no data is streamed, empty status outputs are generated, 
only decorated by the current Time, e.g.:
---
Time: 2015-01-14 17:58:20
---

---
Time: 2015-01-14 17:58:21
---

As soon as I stream some data via netcat, no new status updates will show. 
Instead, one line saying

[Stage :>   
   (2 + 0) / 3]

where  is some integer number, e.g. 132. There is no further output on 
stdout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf

2015-01-14 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277217#comment-14277217
 ] 

Sean Owen commented on SPARK-5235:
--

[~alexbaretta] It certainly may not be your code of course. I mean "people" 
including the Spark code. But surely the problem is solved exactly by not 
trying to serialize {{SQLContext}}, no? despite its declaration, as you've 
demonstrated, it does not serialize, and was not designed to be used after 
serialization given the {{@transient}} field. 

You've suggested a reasonable band-aid on a band-aid but I would either like to 
fix the cause or understand why it's actually supposed to act this way. Other 
Contexts in Spark are not supposed to be serialized. Where I've seen this 
pattern before in the unit tests, it was certainly a hack for convenience that 
didn't matter much because it was just a test. 

Can you run with {{-Dsun.io.serialization.extendeddebuginfo=true}}? this will 
show exactly what had the reference to {{SQLContext}}.

> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> --
>
> Key: SPARK-5235
> URL: https://issues.apache.org/jira/browse/SPARK-5235
> Project: Spark
>  Issue Type: Bug
>Reporter: Alex Baretta
>
> The SQLConf field in SQLContext is neither Serializable nor transient. Here's 
> the stack trace I get when running SQL queries against a Parquet file.
> Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted 
> due to stage failure: Task not serializable: 
> java.io.NotSerializableException: org.apache.spark.sql.SQLConf
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364)
> at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
> at akka.actor.ActorCell.invoke(ActorCell.scala:487)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
> at akka.dispatch.Mailbox.run(Mailbox.scala:220)
> at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
> at 
> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 148 matches

Mail list logo