[jira] [Updated] (SPARK-9273) Add Convolutional Neural network to Spark MLlib

2015-11-11 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-9273:
-
Assignee: yuhao yang

> Add Convolutional Neural network to Spark MLlib
> ---
>
> Key: SPARK-9273
> URL: https://issues.apache.org/jira/browse/SPARK-9273
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: yuhao yang
>Assignee: yuhao yang
>
> Add Convolutional Neural network to Spark MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-08-20 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705810#comment-14705810
 ] 

Jason Dai commented on SPARK-5556:
--

[~pedrorodriguez] We'll try to make a spark package based on our repo; please 
help take a look at the code and provide your feedback. Please let us know if 
there are anything we may collaborate for LDA/topic modeling on Spark.

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10041) Proposal of Parameter Server Interface for Spark

2015-08-17 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699903#comment-14699903
 ] 

Jason Dai commented on SPARK-10041:
---

I think this one is different from SPARK-6932 as it tries to decouple the 
interface design from the actual implementation; a well defined 
interface/abstraction that allows different underlying implementations can be 
very useful.

 Proposal of Parameter Server Interface for Spark
 

 Key: SPARK-10041
 URL: https://issues.apache.org/jira/browse/SPARK-10041
 Project: Spark
  Issue Type: New Feature
  Components: ML, MLlib
Reporter: Yi Liu
 Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf


 Many large-scale machine learning algorithms (logistic regression, LDA, 
 neural network, etc.) have been built on top of Apache Spark. As discussed in 
 SPARK-4590, a Parameter Server (PS) architecture can greatly improve the 
 scalability and efficiency for these large-scale machine learning. There are 
 some previous discussions on possible Parameter Server implementations inside 
 Spark (e.g., SPARK-6932). However, at this stage we believe it is more 
 important for the community to first define the proper interface of Parameter 
 Server, which can be decoupled from the actual PS implementations; 
 consequently, it is possible to support different implementations of 
 Parameter Servers in Spark later. The attached document contains our initial 
 proposal of Parameter Server interface for ML algorithms on Spark, including 
 data model, supported operations, epoch support and possible Spark 
 integrations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-08-12 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693579#comment-14693579
 ] 

Jason Dai commented on SPARK-5556:
--

Sure; we can share our code on github, and then try to make a Spark package :-)

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2015-08-11 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692469#comment-14692469
 ] 

Jason Dai commented on SPARK-5556:
--

[~pedrorodriguez] I wonder if you have made any progress on the LDA package. 

We have actually built a package of topic modeling algorithms for our use 
cases, which contains Gibbs Sampling LDA (adapted to the MLlib LDA interface 
based on the PRs/codes from [~pedrorodriguez] and [~gq], including AliasLDA, 
SparseLDA, LightLDA and FastLDA algorithms), as well as Hierarchical LDA 
(implemented by [~yuhaoyan] from our team). We can also share that package for 
people to try different algorithms.

 Latent Dirichlet Allocation (LDA) using Gibbs sampler 
 --

 Key: SPARK-5556
 URL: https://issues.apache.org/jira/browse/SPARK-5556
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Guoqiang Li
Assignee: Pedro Rodriguez
 Attachments: LDA_test.xlsx, spark-summit.pptx






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1363) Add streaming support for Spark SQL module

2015-03-13 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-1363:
-
Assignee: Saisai Shao

 Add streaming support for Spark SQL module
 --

 Key: SPARK-1363
 URL: https://issues.apache.org/jira/browse/SPARK-1363
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Saisai Shao
Assignee: Saisai Shao
 Attachments: StreamSQLDesignDoc.pdf


 Currently there exists some projects like Pig On Storm, SQL on storm (Squall, 
 SQLstream) that can query over streaming data, but for Spark Streaming, it is 
 a blank area. It will be a good feature to add streaming supported SQL to 
 Spark SQL.
 From semantic perspective, DStream is quite alike RDD, they both have join, 
 filter, groupBy operators and so on, also DStream is backed by RDD, so it is 
 transplant-able and reusable from existing spark plan.
 Also Catalyst has a clear division for each step, we can fully use its parse 
 and logical plan analysis steps,  with only different physical plan.
 So here we propose to add streaming support in Catalyst.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark

2015-02-10 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315605#comment-14315605
 ] 

Jason Dai commented on SPARK-5654:
--

I agree with this proposal. Given all ongoing efforts around data analytics in 
Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very 
well aligned with the project's future plans.

 Integrate SparkR into Apache Spark
 --

 Key: SPARK-5654
 URL: https://issues.apache.org/jira/browse/SPARK-5654
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Shivaram Venkataraman

 The SparkR project [1] provides a light-weight frontend to launch Spark jobs 
 from R. The project was started at the AMPLab around a year ago and has been 
 incubated as its own project to make sure it can be easily merged into 
 upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s 
 goals are similar to PySpark and shares a similar design pattern as described 
 in our meetup talk[2], Spark Summit presentation[3].
 Integrating SparkR into the Apache project will enable R users to use Spark 
 out of the box and given R’s large user base, it will help the Spark project 
 reach more users.  Additionally, work in progress features like providing R 
 integration with ML Pipelines and Dataframes can be better achieved by 
 development in a unified code base.
 SparkR is available under the Apache 2.0 License and does not have any 
 external dependencies other than requiring users to have R and Java installed 
 on their machines.  SparkR’s developers come from many organizations 
 including UC Berkeley, Alteryx, Intel and we will support future development, 
 maintenance after the integration.
 [1] https://github.com/amplab-extras/SparkR-pkg
 [2] http://files.meetup.com/3138542/SparkR-meetup.pdf
 [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5563) LDA with online variational inference

2015-02-05 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-5563:
-
Assignee: yuhao yang

 LDA with online variational inference
 -

 Key: SPARK-5563
 URL: https://issues.apache.org/jira/browse/SPARK-5563
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: yuhao yang

 Latent Dirichlet Allocation (LDA) parameters can be inferred using online 
 variational inference, as in Hoffman, Blei and Bach. “Online Learning for 
 Latent Dirichlet Allocation.”  NIPS, 2010.  This algorithm should be very 
 efficient and should be able to handle much larger datasets than batch 
 algorithms for LDA.
 This algorithm will also be important for supporting Streaming versions of 
 LDA.
 The implementation will ideally use the same API as the existing LDA but use 
 a different underlying optimizer.
 This will require hooking in to the existing mllib.optimization frameworks.
 This will require some discussion about whether batch versions of online 
 variational inference should be supported, as well as what variational 
 approximation should be used now or in the future.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232524#comment-14232524
 ] 

Jason Dai commented on SPARK-4672:
--

We ran into the same issue, and this is a nice summary for the bug analysis. On 
the other hand, while this may fix the specific GraphX issue, I don't think it 
is generally applicable for dealing with super long lineage that can be 
generated in GraphX or other iterative algorithms. 

In particular, the user can define arbitrary functions, which can be called in 
RDD.compute() and refer to an arbitrary member variable that is an RDD, or can 
be used to construct another RDD, such as:
{noformat}
  class MyRDD (val rdd1, val rdd2, func1) extends RDD {
val func2 = (f, iter1, iter2) = iter1– f(iter2)
…
override def compute(part, sc) {
  func2(func1, rdd1.iterator(part, sc), rdd2.iterator(part, sc))
}
…
   define newRDD(val rdd3, func3) = {
 val func4 = func2(func3)
 new AnotherRDD() {
   override def compute(part, sc) {
 func4(rdd1.iterator(part, sc) + rdd2.iterator(part, sc), 
rdd3.iterator(part, sc))
   }  
}
  }
}
{noformat}
In this case, we will need to serialize rdd1 and rdd2 before MyRDD is 
checkpointed; after MyRDD is checkpointed, we don’t need to serialize rdd1 or 
rdd2, but we cannot clear func2 either. 

I think we can fix this more general issues as follows:
# As only RDD.compute(or RDD.iterator) should be called at the worker side, we 
only need to serialize anything that is referenced in that function (no matter 
it’s a member variable or not)
# After the RDD is checkpointed, the RDD.compute should be changed to read the 
checkpint file, which will not reference other variables – again, we only need 
to serialize whatever is referenced in that function now


 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a 

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232529#comment-14232529
 ] 

Jason Dai commented on SPARK-4672:
--

[~rxin] what exactly do you mean by remove all the function closure f from an 
RDD if it is checkpointed?

In my previous example, we should not clear func2 even if MyRDD is 
checkpointed, otherwise newRDD() will be no longer correct. Instead, we should 
make sure we only include RDD.compute(or RDD.iterator) in the closure (no 
matter whether it is checkpointed or not), and change RDD.compute to reading 
checkpoint files once it is checkpointed.

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be detailed in Phase 3) are the Suspect 
 Zero. They together build another serialization chain that can bypass the 
 broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
 the serialization chain can be as long as the original lineage before 
 checkpoint().
 Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
 OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
 can still access the previous RDDs through the (1)-(2) reference chain. As a 
 result, the checkpoint() action is meaningless and the lineage is as long as 
 that before. 
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
 The (1)-(2) chain can be observed in the debug view (in Figure 2).
 {code:borderStyle=solid}
 _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - 
 partitionsRDD:MapPartitionsRDD - RDDs 

[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232549#comment-14232549
 ] 

Jason Dai commented on SPARK-4672:
--

I can see two possible ways to fix this:
# Define customized closure serialization mechanisms in task serializations, 
which can use reflections to carefully choose which to serialize (i.e., only 
those referenced by RDD.iterator); this potentially needs to deal with many 
details and can be error prone.
# In task serialization, each base RDD can generate a dual, shippable RDD, 
which has all transient member variables, and only implements the compute() 
function (which in turn calls the compute() function of the base RDD through 
ClosureCleaner.clean()); we can then probably rely on the Java serializer to 
handle this correctly.

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be detailed in Phase 3) are the Suspect 
 Zero. They together build another serialization chain that can bypass the 
 broken lineage cut by checkpoint() (as shown in Figure 1). In other words, 
 the serialization chain can be as long as the original lineage before 
 checkpoint().
 Figure 1 shows how the unbreakable serialization chain is generated. Yes, the 
 OneToOneDep can be cut off by checkpoint(). However, the serialization chain 
 can still access the previous RDDs through the (1)-(2) reference chain. As a 
 result, the checkpoint() action is meaningless and the lineage is as long as 
 that before. 
 !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%!
 The (1)-(2) 

[jira] [Comment Edited] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error

2014-12-02 Thread Jason Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232549#comment-14232549
 ] 

Jason Dai edited comment on SPARK-4672 at 12/3/14 6:01 AM:
---

I can see two possible ways to fix this:
# Define customized closure serialization mechanisms in task serializations, 
which can use reflections to carefully choose which to serialize (i.e., only 
those referenced by RDD.iterator); this potentially needs to deal with many 
details and can be error prone.
# In task serialization, each base RDD can generate a dual, shippable RDD, 
which only has transient member variables, and only implements the compute() 
function (which in turn calls the compute() function of the base RDD through 
ClosureCleaner.clean()); we can then probably rely on the Java serializer to 
handle this correctly.


was (Author: jason.dai):
I can see two possible ways to fix this:
# Define customized closure serialization mechanisms in task serializations, 
which can use reflections to carefully choose which to serialize (i.e., only 
those referenced by RDD.iterator); this potentially needs to deal with many 
details and can be error prone.
# In task serialization, each base RDD can generate a dual, shippable RDD, 
which has all transient member variables, and only implements the compute() 
function (which in turn calls the compute() function of the base RDD through 
ClosureCleaner.clean()); we can then probably rely on the Java serializer to 
handle this correctly.

 Cut off the super long serialization chain in GraphX to avoid the 
 StackOverflow error
 -

 Key: SPARK-4672
 URL: https://issues.apache.org/jira/browse/SPARK-4672
 Project: Spark
  Issue Type: Bug
  Components: GraphX, Spark Core
Affects Versions: 1.1.0
Reporter: Lijie Xu
Priority: Critical
 Fix For: 1.2.0


 While running iterative algorithms in GraphX, a StackOverflow error will 
 stably occur in the serialization phase at about 300th iteration. In general, 
 these kinds of algorithms have two things in common:
 # They have a long computing chain.
 {code:borderStyle=solid}
 (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”)
 {code}
 # They will iterate many times to converge. An example:
 {code:borderStyle=solid}
 //K-Core Algorithm
 val kNum = 5
 var degreeGraph = graph.outerJoinVertices(graph.degrees) {
   (vid, vd, degree) = degree.getOrElse(0)
 }.cache()
   
 do {
   val subGraph = degreeGraph.subgraph(
   vpred = (vid, degree) = degree = KNum
   ).cache()
   val newDegreeGraph = subGraph.degrees
   degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) {
   (vid, vd, degree) = degree.getOrElse(0)
   }.cache()
   isConverged = check(degreeGraph)
 } while(isConverged == false)
 {code}
 After about 300 iterations, StackOverflow will definitely occur with the 
 following stack trace:
 {code:borderStyle=solid}
 Exception in thread main org.apache.spark.SparkException: Job aborted due 
 to stage failure: Task serialization failed: java.lang.StackOverflowError
 java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275)
 java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230)
 java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426)
 java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
 {code}
 It is a very tricky bug, which only occurs with enough iterations. Since it 
 took us a long time to find out its causes, we will detail the causes in the 
 following 3 paragraphs. 
  
 h3. Phase 1: Try using checkpoint() to shorten the lineage
 It's easy to come to the thought that the long lineage may be the cause. For 
 some RDDs, their lineages may grow with the iterations. Also, for some 
 magical references,  their lineage lengths never decrease and finally become 
 very long. As a result, the call stack of task's 
 serialization()/deserialization() method will be very long too, which finally 
 exhausts the whole JVM stack.
 In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 
 OneToOne dependencies in each iteration in the above example. Lineage length 
 refers to the  maximum length of OneToOne dependencies (e.g., from the 
 finalRDD to the ShuffledRDD) in each stage.
 To shorten the lineage, a checkpoint() is performed every N (e.g., 10) 
 iterations. Then, the lineage will drop down when it reaches a certain length 
 (e.g., 33). 
 However, StackOverflow error still occurs after 300+ iterations!
 h3. Phase 2:  Abnormal f closure function leads to a unbreakable 
 serialization chain
 After a long-time debug, we found that an abnormal _*f*_ function closure and 
 a potential bug in GraphX (will be 

[jira] [Updated] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-10-29 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-2926:
-
Assignee: Saisai Shao

 Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
 --

 Key: SPARK-2926
 URL: https://issues.apache.org/jira/browse/SPARK-2926
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
Assignee: Saisai Shao
 Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test 
 Report(contd).pdf, Spark Shuffle Test Report.pdf


 Currently Spark has already integrated sort-based shuffle write, which 
 greatly improve the IO performance and reduce the memory consumption when 
 reducer number is very large. But for the reducer side, it still adopts the 
 implementation of hash-based shuffle reader, which neglects the ordering 
 attributes of map output data in some situations.
 Here we propose a MR style sort-merge like shuffle reader for sort-based 
 shuffle to better improve the performance of sort-based shuffle.
 Working in progress code and performance test report will be posted later 
 when some unit test bugs are fixed.
 Any comments would be greatly appreciated. 
 Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions

2014-10-29 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-4094:
-
Assignee: Zhang, Liye

 checkpoint should still be available after rdd actions
 --

 Key: SPARK-4094
 URL: https://issues.apache.org/jira/browse/SPARK-4094
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Zhang, Liye
Assignee: Zhang, Liye

 rdd.checkpoint() must be called before any actions on this rdd, if there is 
 any other actions before, checkpoint would never succeed. For the following 
 code as example:
 *rdd = sc.makeRDD(...)*
 *rdd.collect()*
 *rdd.checkpoint()*
 *rdd.count()*
 This rdd would never be checkpointed. But this would not happen for RDD 
 cache. RDD cache would always make successfully before rdd actions no matter 
 whether there is any actions before cache().
 So rdd.checkpoint() should also be with the same behavior with rdd.cache().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4078) New FsPermission instance w/o FsPermission.createImmutable in eventlog

2014-10-29 Thread Jason Dai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Dai updated SPARK-4078:
-
Assignee: Jason Dai

 New FsPermission instance w/o FsPermission.createImmutable in eventlog
 --

 Key: SPARK-4078
 URL: https://issues.apache.org/jira/browse/SPARK-4078
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Jie Huang
Assignee: Jason Dai

 By default, Spark builds its package against Hadoop 1.0.4 version. In that 
 version, it has some FsPermission bug (see HADOOP-7629 by Todd Lipcon). This 
 bug got fixed since 1.1 version. By using that FsPermission.createImmutable() 
 API, end-user may see some RPC exception like below (if turn on eventlog over 
 HDFS). 
 {quote}
 Exception in thread main java.io.IOException: Call to sr484/10.1.2.84:54310 
 failed on local exception: java.io.EOFException
 at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150)
 at org.apache.hadoop.ipc.Client.call(Client.java:1118)
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229)
 at $Proxy6.setPermission(Unknown Source)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85)
 at 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62)
 at $Proxy6.setPermission(Unknown Source)
 at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285)
 at 
 org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572)
 at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138)
 at org.apache.spark.util.FileLogger.start(FileLogger.scala:115)
 at 
 org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74)
 at org.apache.spark.SparkContext.init(SparkContext.scala:324)
 {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org