[jira] [Updated] (SPARK-9273) Add Convolutional Neural network to Spark MLlib
[ https://issues.apache.org/jira/browse/SPARK-9273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-9273: - Assignee: yuhao yang > Add Convolutional Neural network to Spark MLlib > --- > > Key: SPARK-9273 > URL: https://issues.apache.org/jira/browse/SPARK-9273 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: yuhao yang >Assignee: yuhao yang > > Add Convolutional Neural network to Spark MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705810#comment-14705810 ] Jason Dai commented on SPARK-5556: -- [~pedrorodriguez] We'll try to make a spark package based on our repo; please help take a look at the code and provide your feedback. Please let us know if there are anything we may collaborate for LDA/topic modeling on Spark. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10041) Proposal of Parameter Server Interface for Spark
[ https://issues.apache.org/jira/browse/SPARK-10041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699903#comment-14699903 ] Jason Dai commented on SPARK-10041: --- I think this one is different from SPARK-6932 as it tries to decouple the interface design from the actual implementation; a well defined interface/abstraction that allows different underlying implementations can be very useful. Proposal of Parameter Server Interface for Spark Key: SPARK-10041 URL: https://issues.apache.org/jira/browse/SPARK-10041 Project: Spark Issue Type: New Feature Components: ML, MLlib Reporter: Yi Liu Attachments: Proposal of Parameter Server Interface for Spark - v1.pdf Many large-scale machine learning algorithms (logistic regression, LDA, neural network, etc.) have been built on top of Apache Spark. As discussed in SPARK-4590, a Parameter Server (PS) architecture can greatly improve the scalability and efficiency for these large-scale machine learning. There are some previous discussions on possible Parameter Server implementations inside Spark (e.g., SPARK-6932). However, at this stage we believe it is more important for the community to first define the proper interface of Parameter Server, which can be decoupled from the actual PS implementations; consequently, it is possible to support different implementations of Parameter Servers in Spark later. The attached document contains our initial proposal of Parameter Server interface for ML algorithms on Spark, including data model, supported operations, epoch support and possible Spark integrations. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14693579#comment-14693579 ] Jason Dai commented on SPARK-5556: -- Sure; we can share our code on github, and then try to make a Spark package :-) Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler
[ https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14692469#comment-14692469 ] Jason Dai commented on SPARK-5556: -- [~pedrorodriguez] I wonder if you have made any progress on the LDA package. We have actually built a package of topic modeling algorithms for our use cases, which contains Gibbs Sampling LDA (adapted to the MLlib LDA interface based on the PRs/codes from [~pedrorodriguez] and [~gq], including AliasLDA, SparseLDA, LightLDA and FastLDA algorithms), as well as Hierarchical LDA (implemented by [~yuhaoyan] from our team). We can also share that package for people to try different algorithms. Latent Dirichlet Allocation (LDA) using Gibbs sampler -- Key: SPARK-5556 URL: https://issues.apache.org/jira/browse/SPARK-5556 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Guoqiang Li Assignee: Pedro Rodriguez Attachments: LDA_test.xlsx, spark-summit.pptx -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1363) Add streaming support for Spark SQL module
[ https://issues.apache.org/jira/browse/SPARK-1363?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-1363: - Assignee: Saisai Shao Add streaming support for Spark SQL module -- Key: SPARK-1363 URL: https://issues.apache.org/jira/browse/SPARK-1363 Project: Spark Issue Type: New Feature Components: SQL Reporter: Saisai Shao Assignee: Saisai Shao Attachments: StreamSQLDesignDoc.pdf Currently there exists some projects like Pig On Storm, SQL on storm (Squall, SQLstream) that can query over streaming data, but for Spark Streaming, it is a blank area. It will be a good feature to add streaming supported SQL to Spark SQL. From semantic perspective, DStream is quite alike RDD, they both have join, filter, groupBy operators and so on, also DStream is backed by RDD, so it is transplant-able and reusable from existing spark plan. Also Catalyst has a clear division for each step, we can fully use its parse and logical plan analysis steps, with only different physical plan. So here we propose to add streaming support in Catalyst. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5654) Integrate SparkR into Apache Spark
[ https://issues.apache.org/jira/browse/SPARK-5654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14315605#comment-14315605 ] Jason Dai commented on SPARK-5654: -- I agree with this proposal. Given all ongoing efforts around data analytics in Spark (e.g., DataFrame, ml, etc.), an R frontend for Spark seems to be very well aligned with the project's future plans. Integrate SparkR into Apache Spark -- Key: SPARK-5654 URL: https://issues.apache.org/jira/browse/SPARK-5654 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Shivaram Venkataraman The SparkR project [1] provides a light-weight frontend to launch Spark jobs from R. The project was started at the AMPLab around a year ago and has been incubated as its own project to make sure it can be easily merged into upstream Spark, i.e. not introduce any external dependencies etc. SparkR’s goals are similar to PySpark and shares a similar design pattern as described in our meetup talk[2], Spark Summit presentation[3]. Integrating SparkR into the Apache project will enable R users to use Spark out of the box and given R’s large user base, it will help the Spark project reach more users. Additionally, work in progress features like providing R integration with ML Pipelines and Dataframes can be better achieved by development in a unified code base. SparkR is available under the Apache 2.0 License and does not have any external dependencies other than requiring users to have R and Java installed on their machines. SparkR’s developers come from many organizations including UC Berkeley, Alteryx, Intel and we will support future development, maintenance after the integration. [1] https://github.com/amplab-extras/SparkR-pkg [2] http://files.meetup.com/3138542/SparkR-meetup.pdf [3] http://spark-summit.org/2014/talk/sparkr-interactive-r-programs-at-scale-2 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5563) LDA with online variational inference
[ https://issues.apache.org/jira/browse/SPARK-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-5563: - Assignee: yuhao yang LDA with online variational inference - Key: SPARK-5563 URL: https://issues.apache.org/jira/browse/SPARK-5563 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: yuhao yang Latent Dirichlet Allocation (LDA) parameters can be inferred using online variational inference, as in Hoffman, Blei and Bach. “Online Learning for Latent Dirichlet Allocation.” NIPS, 2010. This algorithm should be very efficient and should be able to handle much larger datasets than batch algorithms for LDA. This algorithm will also be important for supporting Streaming versions of LDA. The implementation will ideally use the same API as the existing LDA but use a different underlying optimizer. This will require hooking in to the existing mllib.optimization frameworks. This will require some discussion about whether batch versions of online variational inference should be supported, as well as what variational approximation should be used now or in the future. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232524#comment-14232524 ] Jason Dai commented on SPARK-4672: -- We ran into the same issue, and this is a nice summary for the bug analysis. On the other hand, while this may fix the specific GraphX issue, I don't think it is generally applicable for dealing with super long lineage that can be generated in GraphX or other iterative algorithms. In particular, the user can define arbitrary functions, which can be called in RDD.compute() and refer to an arbitrary member variable that is an RDD, or can be used to construct another RDD, such as: {noformat} class MyRDD (val rdd1, val rdd2, func1) extends RDD { val func2 = (f, iter1, iter2) = iter1– f(iter2) … override def compute(part, sc) { func2(func1, rdd1.iterator(part, sc), rdd2.iterator(part, sc)) } … define newRDD(val rdd3, func3) = { val func4 = func2(func3) new AnotherRDD() { override def compute(part, sc) { func4(rdd1.iterator(part, sc) + rdd2.iterator(part, sc), rdd3.iterator(part, sc)) } } } } {noformat} In this case, we will need to serialize rdd1 and rdd2 before MyRDD is checkpointed; after MyRDD is checkpointed, we don’t need to serialize rdd1 or rdd2, but we cannot clear func2 either. I think we can fix this more general issues as follows: # As only RDD.compute(or RDD.iterator) should be called at the worker side, we only need to serialize anything that is referenced in that function (no matter it’s a member variable or not) # After the RDD is checkpointed, the RDD.compute should be changed to read the checkpint file, which will not reference other variables – again, we only need to serialize whatever is referenced in that function now Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a
[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232529#comment-14232529 ] Jason Dai commented on SPARK-4672: -- [~rxin] what exactly do you mean by remove all the function closure f from an RDD if it is checkpointed? In my previous example, we should not clear func2 even if MyRDD is checkpointed, otherwise newRDD() will be no longer correct. Instead, we should make sure we only include RDD.compute(or RDD.iterator) in the closure (no matter whether it is checkpointed or not), and change RDD.compute to reading checkpoint files once it is checkpointed. Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be detailed in Phase 3) are the Suspect Zero. They together build another serialization chain that can bypass the broken lineage cut by checkpoint() (as shown in Figure 1). In other words, the serialization chain can be as long as the original lineage before checkpoint(). Figure 1 shows how the unbreakable serialization chain is generated. Yes, the OneToOneDep can be cut off by checkpoint(). However, the serialization chain can still access the previous RDDs through the (1)-(2) reference chain. As a result, the checkpoint() action is meaningless and the lineage is as long as that before. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%! The (1)-(2) chain can be observed in the debug view (in Figure 2). {code:borderStyle=solid} _rdd (i.e., A in Figure 1, checkpointed) - f - $outer (VertexRDD) - partitionsRDD:MapPartitionsRDD - RDDs
[jira] [Commented] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232549#comment-14232549 ] Jason Dai commented on SPARK-4672: -- I can see two possible ways to fix this: # Define customized closure serialization mechanisms in task serializations, which can use reflections to carefully choose which to serialize (i.e., only those referenced by RDD.iterator); this potentially needs to deal with many details and can be error prone. # In task serialization, each base RDD can generate a dual, shippable RDD, which has all transient member variables, and only implements the compute() function (which in turn calls the compute() function of the base RDD through ClosureCleaner.clean()); we can then probably rely on the Java serializer to handle this correctly. Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be detailed in Phase 3) are the Suspect Zero. They together build another serialization chain that can bypass the broken lineage cut by checkpoint() (as shown in Figure 1). In other words, the serialization chain can be as long as the original lineage before checkpoint(). Figure 1 shows how the unbreakable serialization chain is generated. Yes, the OneToOneDep can be cut off by checkpoint(). However, the serialization chain can still access the previous RDDs through the (1)-(2) reference chain. As a result, the checkpoint() action is meaningless and the lineage is as long as that before. !https://raw.githubusercontent.com/JerryLead/Misc/master/SparkPRFigures/g1.png|width=100%! The (1)-(2)
[jira] [Comment Edited] (SPARK-4672) Cut off the super long serialization chain in GraphX to avoid the StackOverflow error
[ https://issues.apache.org/jira/browse/SPARK-4672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14232549#comment-14232549 ] Jason Dai edited comment on SPARK-4672 at 12/3/14 6:01 AM: --- I can see two possible ways to fix this: # Define customized closure serialization mechanisms in task serializations, which can use reflections to carefully choose which to serialize (i.e., only those referenced by RDD.iterator); this potentially needs to deal with many details and can be error prone. # In task serialization, each base RDD can generate a dual, shippable RDD, which only has transient member variables, and only implements the compute() function (which in turn calls the compute() function of the base RDD through ClosureCleaner.clean()); we can then probably rely on the Java serializer to handle this correctly. was (Author: jason.dai): I can see two possible ways to fix this: # Define customized closure serialization mechanisms in task serializations, which can use reflections to carefully choose which to serialize (i.e., only those referenced by RDD.iterator); this potentially needs to deal with many details and can be error prone. # In task serialization, each base RDD can generate a dual, shippable RDD, which has all transient member variables, and only implements the compute() function (which in turn calls the compute() function of the base RDD through ClosureCleaner.clean()); we can then probably rely on the Java serializer to handle this correctly. Cut off the super long serialization chain in GraphX to avoid the StackOverflow error - Key: SPARK-4672 URL: https://issues.apache.org/jira/browse/SPARK-4672 Project: Spark Issue Type: Bug Components: GraphX, Spark Core Affects Versions: 1.1.0 Reporter: Lijie Xu Priority: Critical Fix For: 1.2.0 While running iterative algorithms in GraphX, a StackOverflow error will stably occur in the serialization phase at about 300th iteration. In general, these kinds of algorithms have two things in common: # They have a long computing chain. {code:borderStyle=solid} (e.g., “degreeGraph=subGraph=degreeGraph=subGraph=…=”) {code} # They will iterate many times to converge. An example: {code:borderStyle=solid} //K-Core Algorithm val kNum = 5 var degreeGraph = graph.outerJoinVertices(graph.degrees) { (vid, vd, degree) = degree.getOrElse(0) }.cache() do { val subGraph = degreeGraph.subgraph( vpred = (vid, degree) = degree = KNum ).cache() val newDegreeGraph = subGraph.degrees degreeGraph = subGraph.outerJoinVertices(newDegreeGraph) { (vid, vd, degree) = degree.getOrElse(0) }.cache() isConverged = check(degreeGraph) } while(isConverged == false) {code} After about 300 iterations, StackOverflow will definitely occur with the following stack trace: {code:borderStyle=solid} Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task serialization failed: java.lang.StackOverflowError java.io.ObjectOutputStream.writeNonProxyDesc(ObjectOutputStream.java:1275) java.io.ObjectOutputStream.writeClassDesc(ObjectOutputStream.java:1230) java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1426) java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177) {code} It is a very tricky bug, which only occurs with enough iterations. Since it took us a long time to find out its causes, we will detail the causes in the following 3 paragraphs. h3. Phase 1: Try using checkpoint() to shorten the lineage It's easy to come to the thought that the long lineage may be the cause. For some RDDs, their lineages may grow with the iterations. Also, for some magical references, their lineage lengths never decrease and finally become very long. As a result, the call stack of task's serialization()/deserialization() method will be very long too, which finally exhausts the whole JVM stack. In deed, the lineage of some RDDs (e.g., EdgeRDD.partitionsRDD) increases 3 OneToOne dependencies in each iteration in the above example. Lineage length refers to the maximum length of OneToOne dependencies (e.g., from the finalRDD to the ShuffledRDD) in each stage. To shorten the lineage, a checkpoint() is performed every N (e.g., 10) iterations. Then, the lineage will drop down when it reaches a certain length (e.g., 33). However, StackOverflow error still occurs after 300+ iterations! h3. Phase 2: Abnormal f closure function leads to a unbreakable serialization chain After a long-time debug, we found that an abnormal _*f*_ function closure and a potential bug in GraphX (will be
[jira] [Updated] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
[ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-2926: - Assignee: Saisai Shao Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle -- Key: SPARK-2926 URL: https://issues.apache.org/jira/browse/SPARK-2926 Project: Spark Issue Type: Improvement Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Assignee: Saisai Shao Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report(contd).pdf, Spark Shuffle Test Report.pdf Currently Spark has already integrated sort-based shuffle write, which greatly improve the IO performance and reduce the memory consumption when reducer number is very large. But for the reducer side, it still adopts the implementation of hash-based shuffle reader, which neglects the ordering attributes of map output data in some situations. Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better improve the performance of sort-based shuffle. Working in progress code and performance test report will be posted later when some unit test bugs are fixed. Any comments would be greatly appreciated. Thanks a lot. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4094) checkpoint should still be available after rdd actions
[ https://issues.apache.org/jira/browse/SPARK-4094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-4094: - Assignee: Zhang, Liye checkpoint should still be available after rdd actions -- Key: SPARK-4094 URL: https://issues.apache.org/jira/browse/SPARK-4094 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.1.0 Reporter: Zhang, Liye Assignee: Zhang, Liye rdd.checkpoint() must be called before any actions on this rdd, if there is any other actions before, checkpoint would never succeed. For the following code as example: *rdd = sc.makeRDD(...)* *rdd.collect()* *rdd.checkpoint()* *rdd.count()* This rdd would never be checkpointed. But this would not happen for RDD cache. RDD cache would always make successfully before rdd actions no matter whether there is any actions before cache(). So rdd.checkpoint() should also be with the same behavior with rdd.cache(). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4078) New FsPermission instance w/o FsPermission.createImmutable in eventlog
[ https://issues.apache.org/jira/browse/SPARK-4078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Dai updated SPARK-4078: - Assignee: Jason Dai New FsPermission instance w/o FsPermission.createImmutable in eventlog -- Key: SPARK-4078 URL: https://issues.apache.org/jira/browse/SPARK-4078 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Jie Huang Assignee: Jason Dai By default, Spark builds its package against Hadoop 1.0.4 version. In that version, it has some FsPermission bug (see HADOOP-7629 by Todd Lipcon). This bug got fixed since 1.1 version. By using that FsPermission.createImmutable() API, end-user may see some RPC exception like below (if turn on eventlog over HDFS). {quote} Exception in thread main java.io.IOException: Call to sr484/10.1.2.84:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1150) at org.apache.hadoop.ipc.Client.call(Client.java:1118) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:229) at $Proxy6.setPermission(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:85) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:62) at $Proxy6.setPermission(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.setPermission(DFSClient.java:1285) at org.apache.hadoop.hdfs.DistributedFileSystem.setPermission(DistributedFileSystem.java:572) at org.apache.spark.util.FileLogger.createLogDir(FileLogger.scala:138) at org.apache.spark.util.FileLogger.start(FileLogger.scala:115) at org.apache.spark.scheduler.EventLoggingListener.start(EventLoggingListener.scala:74) at org.apache.spark.SparkContext.init(SparkContext.scala:324) {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org