Spark checkpoint problem
I am test checkpoint to understand how it works, My code as following: scala> val data = sc.parallelize(List("a", "b", "c")) data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :15 scala> sc.setCheckpointDir("/tmp/checkpoint") 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be non-local if Spark is running on a cluster: /tmp/checkpoint1 scala> data.checkpoint scala> val temp = data.map(item => (item, 1)) temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :17 scala> temp.checkpoint scala> temp.count but I found that only the temp RDD is checkpont in the /tmp/checkpoint directory, The data RDD is not checkpointed! I found the doCheckpoint function in the org.apache.spark.rdd.RDD class: private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } } from the code above, Only the last RDD(In my case is temp) will be checkpointed, My question : Is deliberately designed or this is a bug? Thank you.
Spark checkpoint problem
Hi, I am test checkpoint to understand how it works, My code as following: scala> val data = sc.parallelize(List("a", "b", "c")) data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :15 scala> sc.setCheckpointDir("/tmp/checkpoint") 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be non-local if Spark is running on a cluster: /tmp/checkpoint1 scala> data.checkpoint scala> val temp = data.map(item => (item, 1)) temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :17 scala> temp.checkpoint scala> temp.count but I found that only the temp RDD is checkpont in the /tmp/checkpoint directory, The data RDD is not checkpointed! I found the doCheckpoint function in the org.apache.spark.rdd.RDD class: private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } } from the code above, Only the last RDD(In my case is temp) will be checkpointed, My question : Is deliberately designed or this is a bug? Thank you.
Re:RE: Spark checkpoint problem
Spark 1.5.2. 在 2015-11-26 13:19:39,"张志强(旺轩)" <zzq98...@alibaba-inc.com> 写道: What’s your spark version? 发件人: wyphao.2007 [mailto:wyphao.2...@163.com] 发送时间: 2015年11月26日 10:04 收件人: user 抄送:dev@spark.apache.org 主题: Spark checkpoint problem I am test checkpoint to understand how it works, My code as following: scala> val data = sc.parallelize(List("a", "b", "c")) data: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at :15 scala> sc.setCheckpointDir("/tmp/checkpoint") 15/11/25 18:09:07 WARN spark.SparkContext: Checkpoint directory must be non-local if Spark is running on a cluster: /tmp/checkpoint1 scala> data.checkpoint scala> val temp = data.map(item => (item, 1)) temp: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at :17 scala> temp.checkpoint scala> temp.count but I found that only the temp RDD is checkpont in the /tmp/checkpoint directory, The data RDD is not checkpointed! I found the doCheckpoint function in the org.apache.spark.rdd.RDD class: private[spark] def doCheckpoint(): Unit = { RDDOperationScope.withScope(sc, "checkpoint", allowNesting = false, ignoreParent = true) { if (!doCheckpointCalled) { doCheckpointCalled = true if (checkpointData.isDefined) { checkpointData.get.checkpoint() } else { dependencies.foreach(_.rdd.doCheckpoint()) } } } } from the code above, Only the last RDD(In my case is temp) will be checkpointed, My question : Is deliberately designed or this is a bug? Thank you.
Re:Re:Driver memory leak?
No, I am not collect the result to driver,I sample send the result to kafka. BTW, the image address are: https://cloud.githubusercontent.com/assets/5170878/7389463/ac03bf34-eea0-11e4-9e6b-1d2fba170c1c.png and https://cloud.githubusercontent.com/assets/5170878/7389480/c629d236-eea0-11e4-983a-dc5aa97c2554.png At 2015-04-29 18:48:33,zhangxiongfei zhangxiongfei0...@163.com wrote: The mount of memory that the driver consumes depends on your program logic,did you try to collect the result of Spark job? At 2015-04-29 18:42:04, wyphao.2007 wyphao.2...@163.com wrote: Hi, Dear developer, I am using Spark Streaming to read data from kafka, the program already run about 120 hours, but today the program failed because of driver's OOM as follow: Container [pid=49133,containerID=container_1429773909253_0050_02_01] is running beyond physical memory limits. Current usage: 2.5 GB of 2.5 GB physical memory used; 3.2 GB of 50 GB virtual memory used. Killing container. I set --driver-memory to 2g, In my mind, driver is responsibility for job scheduler and job monitor(Please correct me If I'm wrong), Why it using so much memory? So I using jmap to monitor other program(already run about 48 hours): sudo /home/q/java7/jdk1.7.0_45/bin/jmap -histo:live 31256, the result as follow: the java.util.HashMap$Entry and java.lang.Long object using about 600Mb memory! and I also using jmap to monitor other program(already run about 1 hours), the result as follow: the java.util.HashMap$Entry and java.lang.Long object doesn't using so many memory, But I found, as time goes by, the java.util.HashMap$Entry and java.lang.Long object will occupied more and more memory, It is driver's memory leak question? or other reason? Thanks Best Regards
Re:Re: java.lang.StackOverflowError when recovery from checkpoint in Streaming
Hi Akhil Das, Thank you for your reply. It is very similar to my problem, I will focus on it. Thanks Best Regards At 2015-04-28 18:08:32,Akhil Das ak...@sigmoidanalytics.com wrote: There's a similar issue reported over here https://issues.apache.org/jira/browse/SPARK-6847 Thanks Best Regards On Tue, Apr 28, 2015 at 7:35 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) to read data from kafka(1k/second), and store the data in windows,the code snippets as follow:val windowedStreamChannel = streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _, new HashPartitioner(numPartition)) .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) = x ++= y, (x: TreeSet[Obj], y: TreeSet[Obj]) = x --= y, Minutes(60), Seconds(2), numPartition, (item: (String, TreeSet[Obj])) = item._2.size != 0)after the application run for an hour, I kill the application and restart it from checkpoint directory, but I encountered an exception:2015-04-27 17:52:40,955 INFO [Driver] - Slicing from 1430126222000 ms to 1430126222000 ms (aligned to 1430126222000 ms and 1430126222000 ms) 2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null java.lang.StackOverflowError at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080) at sun.misc.URLClassPath.getResource(URLClassPath.java:199) at java.net.URLClassLoader$1.run(URLClassLoader.java:358) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:1623) at org.apache.spark.rdd.RDD.filter(RDD.scala:303) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257
java.lang.StackOverflowError when recovery from checkpoint in Streaming
Hi everyone, I am using val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet) to read data from kafka(1k/second), and store the data in windows,the code snippets as follow:val windowedStreamChannel = streamChannel.combineByKey[TreeSet[Obj]](TreeSet[Obj](_), _ += _, _ ++= _, new HashPartitioner(numPartition)) .reduceByKeyAndWindow((x: TreeSet[Obj], y: TreeSet[Obj]) = x ++= y, (x: TreeSet[Obj], y: TreeSet[Obj]) = x --= y, Minutes(60), Seconds(2), numPartition, (item: (String, TreeSet[Obj])) = item._2.size != 0)after the application run for an hour, I kill the application and restart it from checkpoint directory, but I encountered an exception:2015-04-27 17:52:40,955 INFO [Driver] - Slicing from 1430126222000 ms to 1430126222000 ms (aligned to 1430126222000 ms and 1430126222000 ms) 2015-04-27 17:52:40,958 ERROR [Driver] - User class threw exception: null java.lang.StackOverflowError at java.io.UnixFileSystem.getBooleanAttributes0(Native Method) at java.io.UnixFileSystem.getBooleanAttributes(UnixFileSystem.java:242) at java.io.File.exists(File.java:813) at sun.misc.URLClassPath$FileLoader.getResource(URLClassPath.java:1080) at sun.misc.URLClassPath.getResource(URLClassPath.java:199) at java.net.URLClassLoader$1.run(URLClassLoader.java:358) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:190) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122) at org.apache.spark.SparkContext.clean(SparkContext.scala:1623) at org.apache.spark.rdd.RDD.filter(RDD.scala:303) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.FilteredDStream$$anonfun$compute$1.apply(FilteredDStream.scala:35) at scala.Option.map(Option.scala:145) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FlatMappedDStream.compute(FlatMappedDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.FilteredDStream.compute(FilteredDStream.scala:35) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:287) at scala.Option.orElse(Option.scala:257) at org.apache.spark.streaming.dstream.DStream.getOrCompute(DStream.scala:284) at org.apache.spark.streaming.dstream.ShuffledDStream.compute(ShuffledDStream.scala:41) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300) at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
Question about recovery from checkpoint exception[SPARK-6892]
Hi, When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Is someone can help me, The issue is SPARK-6892. thanks
Re:Re: Question about recovery from checkpoint exception[SPARK-6892]
Hi Sean Owen, Thank you for your attention. I know spark.hadoop.validateOutputSpecs. I restart the job, the application id is application_1428664056212_0017 and it recovery from checkpoint, and it will write eventLog into application_1428664056212_0016 dir, I think it shoud write to application_1428664056212_0017 not application_1428664056212_0016. At 2015-04-20 11:46:12,Sean Owen so...@cloudera.com wrote: This is why spark.hadoop.validateOutputSpecs exists, really: https://spark.apache.org/docs/latest/configuration.html On Mon, Apr 20, 2015 at 3:40 AM, wyphao.2007 wyphao.2...@163.com wrote: Hi, When I recovery from checkpoint in yarn-cluster mode using Spark Streaming, I found it will reuse the application id (In my case is application_1428664056212_0016) before falied to write spark eventLog, But now my application id is application_1428664056212_0017,then spark write eventLog will falied, the stacktrace as follow: 15/04/14 10:14:01 WARN util.ShutdownHookManager: ShutdownHook '$anon$3' failed, java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) java.io.IOException: Target log file already exists (hdfs://mycluster/spark-logs/eventLog/application_1428664056212_0016) at org.apache.spark.scheduler.EventLoggingListener.stop(EventLoggingListener.scala:201) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at org.apache.spark.SparkContext$$anonfun$stop$4.apply(SparkContext.scala:1388) at scala.Option.foreach(Option.scala:236) at org.apache.spark.SparkContext.stop(SparkContext.scala:1388) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:107) at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54) Is someone can help me, The issue is SPARK-6892. thanks - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
How to get removed RDD from windows?
I want to get removed RDD from windows as follow, The old RDDs will removed from current window, // _ // | previous window _|___ // |___| current window| -- Time // |_| // // | _| | _| // | | // V V // old RDDs new RDDs // I find the slice function in DStream class can return the DStream between fromTime to toTime. But when I use the function as follow: val now = System.currentTimeMillis() result.slice(new Time(now - 30 * 1000), new Time(now - 30 * 1000 + result.slideDuration.milliseconds)).foreach(item = println(xxx + item)) ssc.start() 30 is the window's duration,Then I got zeroTime has not been initialized exception. Is anyone can help me? thx!
Use mvn to build Spark 1.2.0 failed
Hi all, Today download Spark source from http://spark.apache.org/downloads.html page, and I use ./make-distribution.sh --tgz -Phadoop-2.2 -Pyarn -DskipTests -Dhadoop.version=2.2.0 -Phive to build the release, but I encountered an exception as follow: [INFO] --- build-helper-maven-plugin:1.8:add-source (add-scala-sources) @ spark-parent --- [INFO] Source directory: /home/q/spark/spark-1.2.0/src/main/scala added. [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ spark-parent --- [INFO] [INFO] Reactor Summary: [INFO] [INFO] Spark Project Parent POM .. FAILURE [1.015s] [INFO] Spark Project Networking .. SKIPPED [INFO] Spark Project Shuffle Streaming Service ... SKIPPED [INFO] Spark Project Core SKIPPED [INFO] Spark Project Bagel ... SKIPPED [INFO] Spark Project GraphX .. SKIPPED [INFO] Spark Project Streaming ... SKIPPED [INFO] Spark Project Catalyst SKIPPED [INFO] Spark Project SQL . SKIPPED [INFO] Spark Project ML Library .. SKIPPED [INFO] Spark Project Tools ... SKIPPED [INFO] Spark Project Hive SKIPPED [INFO] Spark Project REPL SKIPPED [INFO] Spark Project YARN Parent POM . SKIPPED [INFO] Spark Project YARN Stable API . SKIPPED [INFO] Spark Project Assembly SKIPPED [INFO] Spark Project External Twitter SKIPPED [INFO] Spark Project External Flume Sink . SKIPPED [INFO] Spark Project External Flume .. SKIPPED [INFO] Spark Project External MQTT ... SKIPPED [INFO] Spark Project External ZeroMQ . SKIPPED [INFO] Spark Project External Kafka .. SKIPPED [INFO] Spark Project Examples SKIPPED [INFO] Spark Project YARN Shuffle Service SKIPPED [INFO] [INFO] BUILD FAILURE [INFO] [INFO] Total time: 1.644s [INFO] Finished at: Mon Dec 22 10:56:35 CST 2014 [INFO] Final Memory: 21M/481M [INFO] [ERROR] Failed to execute goal org.apache.maven.plugins:maven-remote-resources-plugin:1.5:process (default) on project spark-parent: Error finding remote resources manifests: /home/q/spark/spark-1.2.0/target/maven-shared-archive-resources/META-INF/NOTICE (No such file or directory) - [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException but the NOTICE file is in the download spark release: [wyp@spark /home/q/spark/spark-1.2.0]$ ll total 248 drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 assembly drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 bagel drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 bin drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 conf -rw-rw-r-- 1 1000 1000 663 Dec 10 18:02 CONTRIBUTING.md drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 core drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 data drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 dev drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 docker drwxrwxr-x 7 1000 1000 4096 Dec 10 18:02 docs drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 ec2 drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 examples drwxrwxr-x 8 1000 1000 4096 Dec 10 18:02 external drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 extras drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 graphx -rw-rw-r-- 1 1000 1000 45242 Dec 10 18:02 LICENSE -rwxrwxr-x 1 1000 1000 7941 Dec 10 18:02 make-distribution.sh drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 mllib drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 network -rw-rw-r-- 1 1000 1000 22559 Dec 10 18:02 NOTICE -rw-rw-r-- 1 1000 1000 49002 Dec 10 18:02 pom.xml drwxrwxr-x 4 1000 1000 4096 Dec 10 18:02 project drwxrwxr-x 6 1000 1000 4096 Dec 10 18:02 python -rw-rw-r-- 1 1000 1000 3645 Dec 10 18:02 README.md drwxrwxr-x 5 1000 1000 4096 Dec 10 18:02 repl drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 sbin drwxrwxr-x 2 1000 1000 4096 Dec 10 18:02 sbt -rw-rw-r-- 1 1000 1000 7804 Dec 10 18:02 scalastyle-config.xml drwxrwxr-x 6 1000 1000 4096 Dec 10 18:02 sql drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 streaming drwxrwxr-x 3 1000 1000 4096 Dec 10 18:02 tools -rw-rw-r-- 1 1000 1000 838 Dec 10 18:02 tox.ini drwxrwxr-x 5
Re:Re: Announcing Spark 1.2!
In the http://spark.apache.org/downloads.html page,We cann't download the newest Spark release. At 2014-12-19 17:55:29,Sean Owen so...@cloudera.com wrote: Tag 1.2.0 is older than 1.2.0-rc2. I wonder if it just didn't get updated. I assume it's going to be 1.2.0-rc2 plus a few commits related to the release process. On Fri, Dec 19, 2014 at 9:50 AM, Shixiong Zhu zsxw...@gmail.com wrote: Congrats! A little question about this release: Which commit is this release based on? v1.2.0 and v1.2.0-rc2 are pointed to different commits in https://github.com/apache/spark/releases Best Regards, Shixiong Zhu 2014-12-19 16:52 GMT+08:00 Patrick Wendell pwend...@gmail.com: I'm happy to announce the availability of Spark 1.2.0! Spark 1.2.0 is the third release on the API-compatible 1.X line. It is Spark's largest release ever, with contributions from 172 developers and more than 1,000 commits! This release brings operational and performance improvements in Spark core including a new network transport subsytem designed for very large shuffles. Spark SQL introduces an API for external data sources along with Hive 13 support, dynamic partitioning, and the fixed-precision decimal type. MLlib adds a new pipeline-oriented package (spark.ml) for composing multiple algorithms. Spark Streaming adds a Python API and a write ahead log for fault tolerance. Finally, GraphX has graduated from alpha and introduces a stable API along with performance improvements. Visit the release notes [1] to read about the new features, or download [2] the release today. For errata in the contributions or release notes, please e-mail me *directly* (not on-list). Thanks to everyone involved in creating, testing, and documenting this release! [1] http://spark.apache.org/releases/spark-release-1-2-0.html [2] http://spark.apache.org/downloads.html - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
network.ConnectionManager error
Hi, When I run spark job on yarn,and the job finished success,but I found there are some error logs in the logfile as follow(the red color text): 14/09/17 18:25:03 INFO ui.SparkUI: Stopped Spark web UI at http://sparkserver2.cn:63937 14/09/17 18:25:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors 14/09/17 18:25:03 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:03 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:03 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:03 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO spark.MapOutputTrackerMasterActor: MapOutputTrackerActor stopped! 14/09/17 18:25:04 INFO network.ConnectionManager: Selector thread was interrupted! 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,9072) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,9072) not found 14/09/17 18:25:04 INFO network.ConnectionManager: Removing ReceivingConnection to ConnectionManagerId(sparkserver2.cn,14474) 14/09/17 18:25:04 ERROR network.ConnectionManager: Corresponding SendingConnection to ConnectionManagerId(sparkserver2.cn,14474) not found 14/09/17 18:25:04 WARN network.ConnectionManager: All connections not cleaned up 14/09/17 18:25:04 INFO network.ConnectionManager: ConnectionManager stopped 14/09/17 18:25:04 INFO storage.MemoryStore: MemoryStore cleared 14/09/17 18:25:04 INFO storage.BlockManager: BlockManager stopped 14/09/17 18:25:04 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 14/09/17 18:25:04 INFO spark.SparkContext: Successfully stopped SparkContext 14/09/17 18:25:04 INFO yarn.ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 14/09/17 18:25:04 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 14/09/17 18:25:04 INFO Remoting: Remoting shut down 14/09/17 18:25:04 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remoting shut down. What is the cause of this error? My spark version is 1.1.0 hadoop version is 2.2.0. Thank you.
How to use jdbcRDD in JAVA
Hi,I want to know how to use jdbcRDD in JAVA not scala, trying to figure out the last parameter in the constructor of jdbcRDD thanks