zookeeper mesos logging in spark
Hi, Everytime I run my spark application using mesos, I get logs in my console in the form: 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env I0826 15:25:30.949254 960752 sched.cpp:222] Version: 0.28.2 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@log_env 2016-08-26 15:25:30,949:960521(0x7f6bccff9700):ZOO_INFO@zookeep 2016-08-26 15:25:30,951:960521(0x7f6bb4ff9700):ZOO_INFO@check_e 2016-08-26 15:25:30,952:960521(0x7f6bb4ff9700):ZOO_INFO@check_e I0826 15:25:30.952505 960729 group.cpp:349] Group process (grou I0826 15:25:30.952570 960729 group.cpp:831] Syncing group opera I0826 15:25:30.952592 960729 group.cpp:427] Trying to create pa I0826 15:25:30.954211 960722 detector.cpp:152] Detected a new l I0826 15:25:30.954320 960744 group.cpp:700] Trying to get '/mes I0826 15:25:30.955345 960724 detector.cpp:479] A new leading ma I0826 15:25:30.955451 960724 sched.cpp:326] New master detected I0826 15:25:30.955567 960724 sched.cpp:336] No credentials prov I0826 15:25:30.956478 960732 sched.cpp:703] Framework registere Anybody know how to disable them through spark-submit ? Cheers and many thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/zookeeper-mesos-logging-in-spark-tp27607.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Spark task hangs infinitely when accessing S3 from AWS
Sorry, I have not been able to solve the issue. I used speculation mode as workaround to this. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark task hangs infinitely when accessing S3 from AWS
Any hints? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25365.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark task hangs infinitely when accessing S3 from AWS
Some other stats: The number of files I have in the folder is 48. The number of partitions used when reading data is 7315. The maximum size of a file to read is 14G The size of the folder is around: 270G -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25367.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Spark task hangs infinitely when accessing S3 from AWS
Any help on this? this is really blocking me and I don't find any feasible solution yet. Thanks. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25327.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Spark task hangs infinitely when accessing S3 from AWS
Hi guys, when reading data from S3 from AWS using Spark 1.5.1 one of the tasks hangs when reading data in a way that cannot be reproduced. Some times it hangs, some times it doesn't. This is the thread dump from the hung task: "Executor task launch worker-3" daemon prio=10 tid=0x7f419c023000 nid=0x6548 runnable [0x7f425df2b000] java.lang.Thread.State: RUNNABLE at java.net.SocketInputStream.socketRead0(Native Method) at java.net.SocketInputStream.read(SocketInputStream.java:152) at java.net.SocketInputStream.read(SocketInputStream.java:122) at sun.security.ssl.InputRecord.readFully(InputRecord.java:442) at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554) at sun.security.ssl.InputRecord.read(InputRecord.java:509) at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934) - locked <0x7f42c373b4d8> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1332) - locked <0x7f42c373b610> (a java.lang.Object) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1359) at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1343) at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:533) at org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:401) at org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) at org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304) at org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232) at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976) at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892) at org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77) at org.apache.avro.mapred.FsInput.(FsInput.java:37) at org.apache.avro.mapreduce.AvroRecordReaderBase.createSeekableInput(AvroRecordReaderBase.java:171) at org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87) at org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:153) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124) at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297) This are my only manually passed settings (besides the s3 credentials): --conf spark.driver.maxResultSize=4g \ --conf spark.akka.frameSize=500 \ --conf spark.hadoop.fs.s3a.connection.maximum=500 \ I'm using aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.1.jar to be able to read data from AWS. I have been long time struggling with this issue, the only workaround I have been able to find is to use Spark Speculation, however that's not a feasible solution for me anymore. Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Locality level and Kryo
Hi guys, It happens to me quite often that when the locality level of a task goes further than LOCAL (NODE, RACK, etc), I get some of the following exceptions: too many files open, encountered unregistered class id, cannot cast X to Y. I do not get any exceptions during shuffling (which means that kryo works well). I'm running Spark 1.0.0 with the following characteristics: - 18 executors with 30G each - Yarn client mode - ulimit is defined in 500k - Input data: hdfs file with 1000 partitions and 10 GB of size Please any hint would be appreciated -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Locality-level-and-Kryo-tp20708.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Locality Level Kryo
Hi guys, I get Kryo exceptions of the type unregistered class id and cannot cast to class when the locality level of the tasks go beyond LOCAL. However I get no Kryo exceptions during shuffling operations. If the locality level never goes beyond LOCAL everything works fine. Is there a special reason for this? How does kryo handles the tasks that are not in the same locality? Cheers -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Locality-Level-Kryo-tp20740.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Using Spark Context as an attribute of a class cannot be used
Hello guys, I'm using Spark 1.0.0 and Kryo serialization In the Spark Shell, when I create a class that contains as an attribute the SparkContext, in this way: class AAA(val s: SparkContext) { } val aaa = new AAA(sc) and I execute any action using that attribute like: val myNumber = 5 aaa.s.textFile(FILE).filter(_ == myNumber.toString).count or aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Returns a NonSerializibleException: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$AAA at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Any thoughts about how to solve this issue and how can I give a workaround to it? I'm actually developing an Api that will need the usage of this SparkContext several times in different locations, so I will needed to be accessible. Thanks a lot for the cooperation -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Marcelo Vanzin wrote Do you expect to be able to use the spark context on the remote task? Not At all, what I want to create is a wrapper of the SparkContext, to be used only on the driver node. I would like to have in this AAA wrapper several attributes, such as the SparkContext and other configurations for my project. I tested using -Dsun.io.serialization.extendedDebugInfo=true This is the stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA - field (class $iwC$$iwC$$iwC$$iwC, name: aaa, type: class $iwC$$iwC$$iwC$$iwC$AAA) - object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC@24e57dcb) - field (class $iwC$$iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC$$iwC) - object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC@178cc62b) - field (class $iwC$$iwC, name: $iw, type: class $iwC$$iwC$$iwC) - object (class $iwC$$iwC, $iwC$$iwC@1e9f5eeb) - field (class $iwC, name: $iw, type: class $iwC$$iwC) - object (class $iwC, $iwC@37d8e87e) - field (class $line18.$read, name: $iw, type: class $iwC) - object (class $line18.$read, $line18.$read@124551f) - field (class $iwC$$iwC$$iwC, name: $VAL15, type: class $line18.$read) - object (class $iwC$$iwC$$iwC, $iwC$$iwC$$iwC@2e846e6b) - field (class $iwC$$iwC$$iwC$$iwC, name: $outer, type: class $iwC$$iwC$$iwC) - object (class $iwC$$iwC$$iwC$$iwC, $iwC$$iwC$$iwC$$iwC@4b31ba1b) - field (class $iwC$$iwC$$iwC$$iwC$$anonfun$1, name: $outer, type: class $iwC$$iwC$$iwC$$iwC) - object (class $iwC$$iwC$$iwC$$iwC$$anonfun$1, function1) - field (class org.apache.spark.rdd.FilteredRDD, name: f, type: interface scala.Function1) - root object (class org.apache.spark.rdd.FilteredRDD, FilteredRDD[3] at filter at console:20) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) I actually don't understand much about this stack trace. If you can help me, I would appreciate it. Transient didn't work either Thanks a lot -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19679.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Using Spark Context as an attribute of a class cannot be used
Yes, I'm running this in the Shell. In my compiled Jar it works perfectly, the issue is I need to do this on the shell. Any available workarounds? I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects somehow the whole pipeline of data moving? I mean, will I get performance issues when doing this because now the class will be Serialized for some reason that I still don't understand? 2014-11-24 22:33 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] ml-node+s1001560n19687...@n3.nabble.com: Hello, On Mon, Nov 24, 2014 at 12:07 PM, aecc [hidden email] http://user/SendEmail.jtp?type=nodenode=19687i=0 wrote: This is the stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: $iwC$$iwC$$iwC$$iwC$AAA - field (class $iwC$$iwC$$iwC$$iwC, name: aaa, type: class $iwC$$iwC$$iwC$$iwC$AAA) Ah. Looks to me that you're trying to run this in spark-shell, right? I'm not 100% sure of how it works internally, but I think the Scala repl works a little differently than regular Scala code in this regard. When you declare a val in the shell it will behave differently than a val inside a method in a compiled Scala class - the former will behave like an instance variable, the latter like a local variable. So, this is probably why you're running into this. Try compiling your code and running it outside the shell to see how it goes. I'm not sure whether there's a workaround for this when trying things out in the shell - maybe declare an `object` to hold your constants? Never really tried, so YMMV. -- Marcelo - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19687i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19687i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19687.html To unsubscribe from Using Spark Context as an attribute of a class cannot be used, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19668code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Alessandro Chacón Aecc_ORG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19690.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: Using Spark Context as an attribute of a class cannot be used
Ok, great, I'm gonna do do it that way, thanks :). However I still don't understand why this object should be serialized and shipped? aaa.s and sc are both the same object org.apache.spark.SparkContext@1f222881 However this : aaa.s.parallelize(1 to 10).filter(_ == myNumber).count Needs to be serialized, and this: sc.parallelize(1 to 10).filter(_ == myNumber).count does not. 2014-11-24 23:13 GMT+01:00 Marcelo Vanzin [via Apache Spark User List] ml-node+s1001560n19692...@n3.nabble.com: On Mon, Nov 24, 2014 at 1:56 PM, aecc [hidden email] http://user/SendEmail.jtp?type=nodenode=19692i=0 wrote: I checked sqlContext, they use it in the same way I would like to use my class, they make the class Serializable with transient. Does this affects somehow the whole pipeline of data moving? I mean, will I get performance issues when doing this because now the class will be Serialized for some reason that I still don't understand? If you want to do the same thing, your AAA needs to be serializable and you need to mark all non-serializable fields as @transient. The only performance penalty you'll be paying is the serialization / deserialization of the AAA instance, which most probably will be really small compared to the actual work the task will be doing. Unless your class is holding a whole lot of data, in which case you should start thinking about using a broadcast instead. -- Marcelo - To unsubscribe, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19692i=1 For additional commands, e-mail: [hidden email] http://user/SendEmail.jtp?type=nodenode=19692i=2 -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19692.html To unsubscribe from Using Spark Context as an attribute of a class cannot be used, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=19668code=YWxlc3NhbmRyb2FlY2NAZ21haWwuY29tfDE5NjY4fDE2MzQ0ODgyMDU= . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- Alessandro Chacón Aecc_ORG -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Using-Spark-Context-as-an-attribute-of-a-class-cannot-be-used-tp19668p19694.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Sliding Subwindows
Hello, I would like to have a kind of sub windows. The idea is to have 3 windows in the following way: future - -- past w1 w2 w3 So I can do some processing with the new data coming (w1) to main main window (w2) and some processing on the data leaving the window (w3) Any ideas of how can I do this in Spark? Is there a way to create sub windows? or to specify when a window should start reading? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Sliding-Subwindows-tp3572.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Kafka in Yarn
Hi, I would like to know how is the correct way to add kafka to my project in StandAlone YARN, given that now it's in a different artifact than the Spark core. I tried adding the dependency to my project but I get a NotClassFoundException to my main class. Also, that makes my Jar file very big, so it's not convenient. The next code is how I run it: SPARK_JAR=/opt/spark/assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.2.0.jar /opt/spark/bin/spark-class org.apache.spark.deploy.yarn.Client --jar project.jar --class StarterClass --args yarn-standalone --num-workers 1 --worker-cores 1 --master-memory 1536m --worker-memory 1536m Thanks -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-in-Yarn-tp2673.html Sent from the Apache Spark User List mailing list archive at Nabble.com.