Re: Is there any way to stop a jenkins build
Yeah, I thought that my quick fix might address the HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work so I'll now have to do the more principled fix of using a UDF which sleeps for some amount of time. In order to stop builds, you need to have a Jenkins account with the proper permissions. I believe that it's generally only Spark committers and AMPLab members who have accounts + Jenkins SSH access. I've gone ahead killed the build for you. It looks like someone had configured the pull request builder timeout to be 300 minutes (5 hours), but I think we should consider decreasing that to match the timeout used by the Spark full test jobs. On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier < hvanhov...@questtec.nl> wrote: > Thanks. I'll merge the most recent master... > > Still curious if we can stop a build. > > Kind regards, > > Herman van Hövell tot Westerflier > > 2015-12-29 18:59 GMT+01:00 Ted Yu: > >> HiveThriftBinaryServerSuite got stuck. >> >> I thought Josh has fixed this issue: >> >> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in >> HiveThriftBinaryServerSuite >> >> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier < >> hvanhov...@questtec.nl> wrote: >> >>> My AMPLAB jenkins build has been stuck for a few hours now: >>> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull >>> >>> Is there a way for me to stop the build? >>> >>> Kind regards, >>> >>> Herman van Hövell >>> >>> >> >
Re: Is there any way to stop a jenkins build
Thanks. I'll merge the most recent master... Still curious if we can stop a build. Kind regards, Herman van Hövell tot Westerflier 2015-12-29 18:59 GMT+01:00 Ted Yu: > HiveThriftBinaryServerSuite got stuck. > > I thought Josh has fixed this issue: > > [SPARK-11823][SQL] Fix flaky JDBC cancellation test in > HiveThriftBinaryServerSuite > > On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier < > hvanhov...@questtec.nl> wrote: > >> My AMPLAB jenkins build has been stuck for a few hours now: >> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull >> >> Is there a way for me to stop the build? >> >> Kind regards, >> >> Herman van Hövell >> >> >
Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState
Could you create a JIRA? We can continue the discussion there. Thanks! Best Regards, Shixiong Zhu 2015-12-29 3:42 GMT-08:00 Jan Uyttenhove: > Hi guys, > > I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new > mapWithState API, after previously reporting issue SPARK-11932 ( > https://issues.apache.org/jira/browse/SPARK-11932). > > My Spark streaming job involves reading data from a Kafka topic (using > KafkaUtils.createDirectStream), stateful processing (using checkpointing > & mapWithState) & publishing the results back to Kafka. > > I'm now facing the NullPointerException below when restoring from a > checkpoint in the following scenario: > 1/ run job (with local[2]), process data from Kafka while creating & > keeping state > 2/ stop the job > 3/ generate some extra message on the input Kafka topic > 4/ start the job again (and restore offsets & state from the checkpoints) > > The problem is caused by (or at least related to) step 3, i.e. publishing > data to the input topic while the job is stopped. > The above scenario has been tested successfully when: > - step 3 is excluded, so restoring state from a checkpoint is successful > when no messages are added when the job is stopped > - after step 2, the checkpoints are deleted > > Any clues? Am I doing something wrong here, or is there still a problem > with the mapWithState impl? > > Thanx, > > Jan > > > > 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage > 3.0 (TID 24) > java.lang.NullPointerException > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103) > at > org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at > org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) > at > org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at > org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:89) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory > on localhost:10003 (size: 1024.0 B, free: 511.1 MB) > 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 > non-empty blocks out of 8 blocks > 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 > remote fetches in 0 ms > 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as > values in memory (estimated size 1824.0 B, free 488.0 KB) > 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory > on localhost:10003 (size: 1824.0 B, free: 511.1 MB) > 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 > non-empty blocks out of 8 blocks > 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 > remote fetches in 0 ms > 15/12/29 11:56:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage > 3.0 (TID 24, localhost): java.lang.NullPointerException > at >
Is there any way to stop a jenkins build
My AMPLAB jenkins build has been stuck for a few hours now: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull Is there a way for me to stop the build? Kind regards, Herman van Hövell
Re: Is there any way to stop a jenkins build
Hi Josh, Your HiveThriftBinaryServerSuite fix wasn't in the build I was running (I forgot to merge the latest master). So it might actually work. As for stopping the build, it is understandable that you cannot do that without the proper permissions. It would still be cool to be able to issue a 'stop build' command from github though. Kind regards, Herman 2015-12-29 19:19 GMT+01:00 Josh Rosen: > Yeah, I thought that my quick fix might address the > HiveThriftBinaryServerSuite hanging issue, but it looks like it didn't work > so I'll now have to do the more principled fix of using a UDF which sleeps > for some amount of time. > > In order to stop builds, you need to have a Jenkins account with the > proper permissions. I believe that it's generally only Spark committers and > AMPLab members who have accounts + Jenkins SSH access. > > I've gone ahead killed the build for you. It looks like someone had > configured the pull request builder timeout to be 300 minutes (5 hours), > but I think we should consider decreasing that to match the timeout used by > the Spark full test jobs. > > On Tue, Dec 29, 2015 at 10:04 AM, Herman van Hövell tot Westerflier < > hvanhov...@questtec.nl> wrote: > >> Thanks. I'll merge the most recent master... >> >> Still curious if we can stop a build. >> >> Kind regards, >> >> Herman van Hövell tot Westerflier >> >> 2015-12-29 18:59 GMT+01:00 Ted Yu : >> >>> HiveThriftBinaryServerSuite got stuck. >>> >>> I thought Josh has fixed this issue: >>> >>> [SPARK-11823][SQL] Fix flaky JDBC cancellation test in >>> HiveThriftBinaryServerSuite >>> >>> On Tue, Dec 29, 2015 at 9:56 AM, Herman van Hövell tot Westerflier < >>> hvanhov...@questtec.nl> wrote: >>> My AMPLAB jenkins build has been stuck for a few hours now: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48414/consoleFull Is there a way for me to stop the build? Kind regards, Herman van Hövell >>> >> >
Re: RDD[Vector] Immutability issue
Hi salexln, RDD's immutability depends on the underlying structure. I have the following example. -- scala> val m = Array.fill(2, 2)(0) m: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0)) scala> val rdd = sc.parallelize(m) rdd: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[1] at parallelize at :23 scala> rdd.collect() res6: Array[Array[Int]] = Array(Array(0, 0), Array(0, 0)) scala> m(0)(1) = 2 scala> rdd.collect() res8: Array[Array[Int]] = Array(Array(0, 2), Array(0, 0)) -- You see that variable rdd actually changes when its underlying array changes. Hopefully this helps you. Best, Ai On Mon, Dec 28, 2015 at 12:36 PM, salexlnwrote: > Hi guys, > I know the RDDs are immutable and therefore their value cannot be changed > but I see the following behaviour: > I wrote an implementation for FuzzyCMeans algorithm and now I'm testing it, > so i run the following example: > > import org.apache.spark.mllib.clustering.FuzzyCMeans > import org.apache.spark.mllib.linalg.Vectors > > val data = > sc.textFile("/home/development/myPrjects/R/butterfly/butterfly.txt") > val parsedData = data.map(s => Vectors.dense(s.split(' > ').map(_.toDouble))).cache() >> parsedData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] >> = MapPartitionsRDD[2] at map at :31 > > val numClusters = 2 > val numIterations = 20 > > parsedData.foreach{ point => println(point) } >> [0.0,-8.0] > [-3.0,-2.0] > [-3.0,0.0] > [-3.0,2.0] > [-2.0,-1.0] > [-2.0,0.0] > [-2.0,1.0] > [-1.0,0.0] > [0.0,0.0] > [1.0,0.0] > [2.0,-1.0] > [2.0,0.0] > [2.0,1.0] > [3.0,-2.0] > [3.0,0.0] > [3.0,2.0] > [0.0,8.0] > > val clusters = FuzzyCMeans.train(parsedData, numClusters, numIteration > parsedData.foreach{ point => println(point) } >> > [0.0,-0.480185624595] > [-0.1811743096972924,-0.12078287313152826] > [-0.06638890786148487,0.0] > [-0.04005925925925929,0.02670617283950619] > [-0.12193263222069807,-0.060966316110349035] > [-0.0512,0.0] > [NaN,NaN] > [-0.049382716049382706,0.0] > [NaN,NaN] > [0.006830134553650707,0.0] > [0.05122,-0.02561] > [0.04755220304297078,0.0] > [0.06581619798335057,0.03290809899167529] > [0.12010867103812725,-0.0800724473587515] > [0.10946638900458144,0.0] > [0.14814814814814817,0.09876543209876545] > [0.0,0.49119985188436205] > > > > But how can this be that my method changes the Immutable RDD? > > BTW, the signature of the train method, is the following: > > train( data: RDD[Vector], clusters: Int, maxIterations: Int) > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > -- Best Ai - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD[Vector] Immutability issue
Same thing. Say, your underlying structure is like Array(ArrayBuffer(1, 2), ArrayBuffer(3, 4)). Then you can add/remove data in ArrayBuffers and then the change will be reflected in the rdd. On Tue, Dec 29, 2015 at 11:19 AM, salexlnwrote: > I see, so in order the RDD to be completely immutable, its content should be > immutable as well. > > And if the content is not immutable, we can change its content, but cannot > add / remove data? > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > -- Best Ai - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD[Vector] Immutability issue
You can, but you shouldn't. Using backdoors to mutate the data in an RDD is a good way to produce confusing and inconsistent results when, e.g., an RDD's lineage needs to be recomputed or a Task is resubmitted on fetch failure. On Tue, Dec 29, 2015 at 11:24 AM, ai hewrote: > Same thing. > > Say, your underlying structure is like Array(ArrayBuffer(1, 2), > ArrayBuffer(3, 4)). > > Then you can add/remove data in ArrayBuffers and then the change will > be reflected in the rdd. > > > > On Tue, Dec 29, 2015 at 11:19 AM, salexln wrote: > > I see, so in order the RDD to be completely immutable, its content > should be > > immutable as well. > > > > And if the content is not immutable, we can change its content, but > cannot > > add / remove data? > > > > > > > > > > -- > > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html > > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > > For additional commands, e-mail: dev-h...@spark.apache.org > > > > > > -- > Best > Ai > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: RDD[Vector] Immutability issue
RDD is collection of object And if these objects are mutable and changed then the same will reflect in RDD. For immutable objects it will not. Changing the mutable objects that are in the RDD is not right practise. The RDD is immutable in the sense that any transformation on the RDD will result in new RDD object. On Tue, Dec 29, 2015 at 2:50 PM, Mark Hamstrawrote: > You can, but you shouldn't. Using backdoors to mutate the data in an RDD > is a good way to produce confusing and inconsistent results when, e.g., an > RDD's lineage needs to be recomputed or a Task is resubmitted on fetch > failure. > > On Tue, Dec 29, 2015 at 11:24 AM, ai he wrote: > >> Same thing. >> >> Say, your underlying structure is like Array(ArrayBuffer(1, 2), >> ArrayBuffer(3, 4)). >> >> Then you can add/remove data in ArrayBuffers and then the change will >> be reflected in the rdd. >> >> >> >> On Tue, Dec 29, 2015 at 11:19 AM, salexln wrote: >> > I see, so in order the RDD to be completely immutable, its content >> should be >> > immutable as well. >> > >> > And if the content is not immutable, we can change its content, but >> cannot >> > add / remove data? >> > >> > >> > >> > >> > -- >> > View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Vector-Immutability-issue-tp15827p15841.html >> > Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> > For additional commands, e-mail: dev-h...@spark.apache.org >> > >> >> >> >> -- >> Best >> Ai >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >
Re: Spark streaming 1.6.0-RC4 NullPointerException using mapWithState
Hi Jan, could you post your codes? I could not reproduce this issue in my environment. Best Regards, Shixiong Zhu 2015-12-29 10:22 GMT-08:00 Shixiong Zhu: > Could you create a JIRA? We can continue the discussion there. Thanks! > > Best Regards, > Shixiong Zhu > > 2015-12-29 3:42 GMT-08:00 Jan Uyttenhove : > >> Hi guys, >> >> I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new >> mapWithState API, after previously reporting issue SPARK-11932 ( >> https://issues.apache.org/jira/browse/SPARK-11932). >> >> My Spark streaming job involves reading data from a Kafka topic >> (using KafkaUtils.createDirectStream), stateful processing (using >> checkpointing & mapWithState) & publishing the results back to Kafka. >> >> I'm now facing the NullPointerException below when restoring from a >> checkpoint in the following scenario: >> 1/ run job (with local[2]), process data from Kafka while creating & >> keeping state >> 2/ stop the job >> 3/ generate some extra message on the input Kafka topic >> 4/ start the job again (and restore offsets & state from the checkpoints) >> >> The problem is caused by (or at least related to) step 3, i.e. publishing >> data to the input topic while the job is stopped. >> The above scenario has been tested successfully when: >> - step 3 is excluded, so restoring state from a checkpoint is successful >> when no messages are added when the job is stopped >> - after step 2, the checkpoints are deleted >> >> Any clues? Am I doing something wrong here, or is there still a problem >> with the mapWithState impl? >> >> Thanx, >> >> Jan >> >> >> >> 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage >> 3.0 (TID 24) >> java.lang.NullPointerException >> at >> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103) >> at >> org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) >> at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> at >> org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) >> at >> org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) >> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) >> at >> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) >> at org.apache.spark.scheduler.Task.run(Task.scala:89) >> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >> at java.lang.Thread.run(Thread.java:745) >> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory >> on localhost:10003 (size: 1024.0 B, free: 511.1 MB) >> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 >> non-empty blocks out of 8 blocks >> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 >> remote fetches in 0 ms >> 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as >> values in memory (estimated size 1824.0 B, free 488.0 KB) >> 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory >> on localhost:10003 (size: 1824.0 B, free: 511.1 MB) >> 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 >> non-empty blocks out of 8 blocks >> 15/12/29
Re: running lda in spark throws exception
Hi Li, I'm wondering if you're running into the same bug reported here: https://issues.apache.org/jira/browse/SPARK-12488 I haven't figured out yet what is causing it. Do you have a small corpus which reproduces this error, and which you can share on the JIRA? If so, that would help a lot in debugging this failure. Thanks! Joseph On Sun, Dec 27, 2015 at 7:26 PM, Li Liwrote: > I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2. > it throws exception in line: Matrix topics = ldaModel.topicsMatrix(); > But in yarn job history ui, it's successful. What's wrong with it? > I submit job with > .bin/spark-submit --class Myclass \ > --master yarn-client \ > --num-executors 2 \ > --driver-memory 4g \ > --executor-memory 4g \ > --executor-cores 1 \ > > > My codes: > >corpus.cache(); > > > // Cluster the documents into three topics using LDA > > DistributedLDAModel ldaModel = (DistributedLDAModel) new > > LDA().setOptimizer("em").setMaxIterations(iterNumber).setK(topicNumber).run(corpus); > > > // Output topics. Each is a distribution over words (matching word > count vectors) > > System.out.println("Learned topics (as distributions over vocab of > " + ldaModel.vocabSize() > > + " words):"); > >//Line81, exception here:Matrix topics = ldaModel.topicsMatrix(); > > for (int topic = 0; topic < topicNumber; topic++) { > > System.out.print("Topic " + topic + ":"); > > for (int word = 0; word < ldaModel.vocabSize(); word++) { > > System.out.print(" " + topics.apply(word, topic)); > > } > > System.out.println(); > > } > > > ldaModel.save(sc.sc(), modelPath); > > > Exception in thread "main" java.lang.IndexOutOfBoundsException: > (1025,0) not in [-58,58) x [-100,100) > > at > breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112) > > at > org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534) > > at > org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531) > > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > > at > scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > > at > org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531) > > at > org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523) > > at > com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:81) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:606) > > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) > > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) > > at > org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) > > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) > > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) > > 15/12/23 00:01:16 INFO spark.SparkContext: Invoking stop() from shutdown > hook > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
IndentationCheck of checkstyle
Hi, I noticed that there are a lot of checkstyle warnings in the following form: To my knowledge, we use two spaces for each tab. Not sure why all of a sudden we have so many IndentationCheck warnings: grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc 3133 52645 678294 If there is no objection, I would create a JIRA and relax IndentationCheck warning. Cheers
problem with reading source code-pull out nondeterministic expresssions
Hi fellas, I am new to spark and I have a newbie question. I am currently reading the source code in spark sql catalyst analyzer. I not quite understand the partial function in PullOutNondeterministric. What does it mean by "pull out”? Why do we have to do the "pulling out”? I would really appreciate it if somebody explain it to me. Thanks.
Re: IndentationCheck of checkstyle
OK to close the loop - this thread has nothing to do with Spark? On Tue, Dec 29, 2015 at 9:55 PM, Ted Yuwrote: > Oops, wrong list :-) > > On Dec 29, 2015, at 9:48 PM, Reynold Xin wrote: > > +Herman > > Is this coming from the newly merged Hive parser? > > > > On Tue, Dec 29, 2015 at 9:46 PM, Allen Zhang > wrote: > >> >> >> format issue I think, go ahead >> >> >> >> >> At 2015-12-30 13:36:05, "Ted Yu" wrote: >> >> Hi, >> I noticed that there are a lot of checkstyle warnings in the following >> form: >> >> > source="com.puppycrawl.tools.checkstyle. >> checks.indentation.IndentationCheck"/> >> >> To my knowledge, we use two spaces for each tab. Not sure why all of a >> sudden we have so many IndentationCheck warnings: >> >> grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc >> 3133 52645 678294 >> >> If there is no objection, I would create a JIRA and >> relax IndentationCheck warning. >> >> Cheers >> >> >> >> >> > >
Re: IndentationCheck of checkstyle
Oops, wrong list :-) > On Dec 29, 2015, at 9:48 PM, Reynold Xinwrote: > > +Herman > > Is this coming from the newly merged Hive parser? > > > >> On Tue, Dec 29, 2015 at 9:46 PM, Allen Zhang wrote: >> >> >> format issue I think, go ahead >> >> >> >> >> At 2015-12-30 13:36:05, "Ted Yu" wrote: >> Hi, >> I noticed that there are a lot of checkstyle warnings in the following form: >> >> > source="com.puppycrawl.tools.checkstyle. >> checks.indentation.IndentationCheck"/> >> >> To my knowledge, we use two spaces for each tab. Not sure why all of a >> sudden we have so many IndentationCheck warnings: >> >> grep 'hild have incorrect indentati' trunkCheckstyle.xml | wc >> 3133 52645 678294 >> >> If there is no objection, I would create a JIRA and relax IndentationCheck >> warning. >> >> Cheers >
Re: running lda in spark throws exception
I will use a portion of data and try. will the hdfs block affect spark?(if so, it's hard to reproduce) On Wed, Dec 30, 2015 at 3:22 AM, Joseph Bradleywrote: > Hi Li, > > I'm wondering if you're running into the same bug reported here: > https://issues.apache.org/jira/browse/SPARK-12488 > > I haven't figured out yet what is causing it. Do you have a small corpus > which reproduces this error, and which you can share on the JIRA? If so, > that would help a lot in debugging this failure. > > Thanks! > Joseph > > On Sun, Dec 27, 2015 at 7:26 PM, Li Li wrote: >> >> I ran my lda example in a yarn 2.6.2 cluster with spark 1.5.2. >> it throws exception in line: Matrix topics = ldaModel.topicsMatrix(); >> But in yarn job history ui, it's successful. What's wrong with it? >> I submit job with >> .bin/spark-submit --class Myclass \ >> --master yarn-client \ >> --num-executors 2 \ >> --driver-memory 4g \ >> --executor-memory 4g \ >> --executor-cores 1 \ >> >> >> My codes: >> >>corpus.cache(); >> >> >> // Cluster the documents into three topics using LDA >> >> DistributedLDAModel ldaModel = (DistributedLDAModel) new >> >> LDA().setOptimizer("em").setMaxIterations(iterNumber).setK(topicNumber).run(corpus); >> >> >> // Output topics. Each is a distribution over words (matching word >> count vectors) >> >> System.out.println("Learned topics (as distributions over vocab of >> " + ldaModel.vocabSize() >> >> + " words):"); >> >>//Line81, exception here:Matrix topics = ldaModel.topicsMatrix(); >> >> for (int topic = 0; topic < topicNumber; topic++) { >> >> System.out.print("Topic " + topic + ":"); >> >> for (int word = 0; word < ldaModel.vocabSize(); word++) { >> >> System.out.print(" " + topics.apply(word, topic)); >> >> } >> >> System.out.println(); >> >> } >> >> >> ldaModel.save(sc.sc(), modelPath); >> >> >> Exception in thread "main" java.lang.IndexOutOfBoundsException: >> (1025,0) not in [-58,58) x [-100,100) >> >> at >> breeze.linalg.DenseMatrix$mcD$sp.update$mcD$sp(DenseMatrix.scala:112) >> >> at >> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:534) >> >> at >> org.apache.spark.mllib.clustering.DistributedLDAModel$$anonfun$topicsMatrix$1.apply(LDAModel.scala:531) >> >> at >> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) >> >> at >> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) >> >> at >> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix$lzycompute(LDAModel.scala:531) >> >> at >> org.apache.spark.mllib.clustering.DistributedLDAModel.topicsMatrix(LDAModel.scala:523) >> >> at >> com.mobvoi.knowledgegraph.textmining.lda.ReviewLDA.main(ReviewLDA.java:81) >> >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) >> >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> >> at java.lang.reflect.Method.invoke(Method.java:606) >> >> at >> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674) >> >> at >> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) >> >> at >> org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) >> >> at >> org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) >> >> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >> >> 15/12/23 00:01:16 INFO spark.SparkContext: Invoking stop() from shutdown >> hook >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> > - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Partitioning of RDD across worker machines
Hi, Suppose I have a file locally on my master machine and the same file is also present in the same path on all the worker machines , say /home/user_name/Desktop. I wanted to know that when we partition the data using sc.parallelize , Spark actually broadcasts parts of the RDD to all the worker machines or it reads the corresponding segment locally from the memory of the worker machine? How to I avoid movement of this data? Will it help if I store the file in HDFS? Thanks and Regards, Disha
Spark streaming 1.6.0-RC4 NullPointerException using mapWithState
Hi guys, I upgraded to the RC4 of Spark (streaming) 1.6.0 to (re)test the new mapWithState API, after previously reporting issue SPARK-11932 ( https://issues.apache.org/jira/browse/SPARK-11932). My Spark streaming job involves reading data from a Kafka topic (using KafkaUtils.createDirectStream), stateful processing (using checkpointing & mapWithState) & publishing the results back to Kafka. I'm now facing the NullPointerException below when restoring from a checkpoint in the following scenario: 1/ run job (with local[2]), process data from Kafka while creating & keeping state 2/ stop the job 3/ generate some extra message on the input Kafka topic 4/ start the job again (and restore offsets & state from the checkpoints) The problem is caused by (or at least related to) step 3, i.e. publishing data to the input topic while the job is stopped. The above scenario has been tested successfully when: - step 3 is excluded, so restoring state from a checkpoint is successful when no messages are added when the job is stopped - after step 2, the checkpoints are deleted Any clues? Am I doing something wrong here, or is there still a problem with the mapWithState impl? Thanx, Jan 15/12/29 11:56:12 ERROR executor.Executor: Exception in task 0.0 in stage 3.0 (TID 24) java.lang.NullPointerException at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103) at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111) at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56) at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28) at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$.updateRecordWithData(MapWithStateRDD.scala:55) at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:154) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.streaming.rdd.MapWithStateRDD.compute(MapWithStateRDD.scala:148) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_25_1 in memory on localhost:10003 (size: 1024.0 B, free: 511.1 MB) 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 8 blocks 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/12/29 11:56:12 INFO storage.MemoryStore: Block rdd_29_1 stored as values in memory (estimated size 1824.0 B, free 488.0 KB) 15/12/29 11:56:12 INFO storage.BlockManagerInfo: Added rdd_29_1 in memory on localhost:10003 (size: 1824.0 B, free: 511.1 MB) 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Getting 0 non-empty blocks out of 8 blocks 15/12/29 11:56:12 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms 15/12/29 11:56:12 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 3.0 (TID 24, localhost): java.lang.NullPointerException at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:103) at org.apache.spark.streaming.util.OpenHashMapBasedStateMap.get(StateMap.scala:111) at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:56) at org.apache.spark.streaming.rdd.MapWithStateRDDRecord$$anonfun$updateRecordWithData$1.apply(MapWithStateRDD.scala:55) at