Exception in Shutdown-thread, bad file descriptor
Hi all, We are getting the following exception and this somehow blocks the parent thread from proceeding further. 17/11/14 16:50:09 SPARK_APP WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/11/14 16:50:17 SPARK_APP WARN SparkContext: Use an existing SparkContext, some configuration may not take effect. [Stage 4:> (13 + 1) / 200][Stage 5:> (1 + 0) / 8][Stage 8:> (0 + 0) / 8]Exception in thread "Shutdown-checker" java.lang.RuntimeException: eventfd_write() failed: Bad file descriptor at io.netty.channel.epoll.Native.eventFdWrite(Native Method) at io.netty.channel.epoll.EpollEventLoop.wakeup(EpollEventLoop.java:106) at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:538) at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:146) at io.netty.util.concurrent.AbstractEventExecutorGroup.shutdownGracefully(AbstractEventExecutorGroup.java:69) at com.datastax.driver.core.NettyOptions.onClusterClose(NettyOptions.java:193) at com.datastax.driver.core.Connection$Factory.shutdown(Connection.java:902) at com.datastax.driver.core.Cluster$Manager$ClusterCloseFuture$1.run(Cluster.java:2539) This is very hard to replicate, so I am not able to come up with a re-produceable recipe. Has anyone faced similar issue? Any help is appreciated. Spark Version: 2.0.1 Thanks and Regards Noorul
Controlling number of spark partitions in dataframes
Hi all, I have the following spark configuration spark.app.name=Test spark.cassandra.connection.host=127.0.0.1 spark.cassandra.connection.keep_alive_ms=5000 spark.cassandra.connection.port=1 spark.cassandra.connection.timeout_ms=3 spark.cleaner.ttl=3600 spark.default.parallelism=4 spark.master=local[2] spark.ui.enabled=false spark.ui.showConsoleProgress=false Because I am setting spark.default.parallelism to 4, I was expecting only 4 spark partitions. But it looks like it is not the case When I do the following df.foreachPartition { partition => val groupedPartition = partition.toList.grouped(3).toList println("Grouped partition " + groupedPartition) } There are too many print statements with empty list at the top. Only the relevant partitions are at the bottom. Is there a way to control number of partitions? Regards, Noorul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
What is the equivalent of forearchRDD in DataFrames?
Hi all, I have a Dataframe with 1000 records. I want to split them into 100 each and post to rest API. If it was RDD, I could use something like this myRDD.foreachRDD { rdd => rdd.foreachPartition { partition => { This will ensure that code is executed on executors and not on driver. Is there any similar approach that we can take for Dataframes? I see examples on stackoverflow with collect() which will bring whole data to driver. Thanks and Regards Noorul - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How best we can store streaming data on dashboards for real time user experience?
I think better place would be a in memory cache for real time. Regards, Noorul On Thu, Mar 30, 2017 at 10:31 AM, Gaurav1809wrote: > I am getting streaming data and want to show them onto dashboards in real > time? > May I know how best we can handle these streaming data? where to store? (DB > or HDFS or ???) > I want to give users a real time analytics experience. > > Please suggest possible ways. Thanks. > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-best-we-can-store-streaming-data-on-dashboards-for-real-time-user-experience-tp28548.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Application kill from UI do not propagate exception
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul spark@spark1:~/spark-2.1.0/sbin$ spark-shell --master spark://10.29.83.162:7077 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/23 15:16:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/23 15:16:53 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.29.83.162:4040 Spark context available as 'sc' (master = spark://10.29.83.162:7077, app id = app-20170323151648-0002). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala> 17/03/23 15:17:28 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED 17/03/23 15:17:28 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:139) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:254) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@25b8f9d2 scala> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Application kill from UI do not propagate exception
Hi all, I am trying to trap UI kill event of a spark application from driver. Some how the exception thrown is not propagated to the driver main program. See for example using spark-shell below. Is there a way to get hold of this event and shutdown the driver program? Regards, Noorul spark@spark1:~/spark-2.1.0/sbin$ spark-shell --master spark://10.29.83.162:7077 Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 17/03/23 15:16:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/03/23 15:16:53 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException Spark context Web UI available at http://10.29.83.162:4040 Spark context available as 'sc' (master = spark://10.29.83.162:7077, app id = app-20170323151648-0002). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.1.0 /_/ Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91) Type in expressions to have them evaluated. Type :help for more information. scala> 17/03/23 15:17:28 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED 17/03/23 15:17:28 ERROR Inbox: Ignoring error org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:459) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:139) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:254) at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:168) at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101) at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) scala> sc res0: org.apache.spark.SparkContext = org.apache.spark.SparkContext@25b8f9d2 scala> - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: How do I deal with ever growing application log
Or you could use sinks like elasticsearch. Regards, Noorul On Mon, Mar 6, 2017 at 10:52 AM, devjyoti patrawrote: > Timothy, why are you writing application logs to HDFS? In case you want to > analyze these logs later, you can write to local storage on your slave nodes > and later rotate those files to a suitable location. If they are only going > to useful for debugging the application, you can always remove them > periodically. > Thanks, > Dev > > On Mar 6, 2017 9:48 AM, "Timothy Chan" wrote: >> >> I'm running a single worker EMR cluster for a Structured Streaming job. >> How do I deal with my application log filling up HDFS? >> >> /var/log/spark/apps/application_1487823545416_0021_1.inprogress >> >> is currently 21.8 GB >> >> Sent with Shift - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Testing --supervise flag
Widening to dev@spark On Mon, Aug 1, 2016 at 4:21 PM, Noorul Islam K Mwrote: > > Hi all, > > I was trying to test --supervise flag of spark-submit. > > The documentation [1] says that, the flag helps in restarting your > application automatically if it exited with non-zero exit code. > > I am looking for some clarification on that documentation. In this > context, does application means the driver? > > Will the driver be re-launched if an exception is thrown by the > application? I tested this scenario and the driver is not re-launched. > > ~/spark-1.6.1/bin/spark-submit --deploy-mode cluster --master > spark://10.29.83.162:6066 --class > org.apache.spark.examples.ExceptionHandlingTest > /home/spark/spark-1.6.1/lib/spark-examples-1.6.1-hadoop2.6.0.jar > > I killed the driver java process using 'kill -9' command and the driver > is re-launched. > > Is this the only scenario were driver will be re-launched? Is there a > way to simulate non-zero exit code and test the use of --supervise flag? > > Regards, > Noorul > > [1] > http://spark.apache.org/docs/latest/spark-standalone.html#launching-spark-applications - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Application not showing in Spark History
Have you tried https://github.com/spark-jobserver/spark-jobserver On Tue, Aug 2, 2016 at 2:23 PM, Rychnovsky, Dusanwrote: > Hi, > > > I am trying to launch my Spark application from within my Java application > via the SparkSubmit class, like this: > > > > List args = new ArrayList<>(); > > args.add("--verbose"); > args.add("--deploy-mode=cluster"); > args.add("--master=yarn"); > ... > > > SparkSubmit.main(args.toArray(new String[args.size()])); > > > > This works fine, with one catch - the application does not appear in Spark > History after it's finished. > > > If, however, I run the application using `spark-submit.sh`, like this: > > > > spark-submit \ > > --verbose \ > > --deploy-mode=cluster \ > > --master=yarn \ > > ... > > > > the application appears in Spark History correctly. > > > What am I missing? > > > Also, is this a good way to launch a Spark application from within a Java > application or is there a better way? > > Thanks, > > Dusan > > - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: When worker is killed driver continues to run causing issues in supervise mode
Adding dev list On Jul 13, 2016 5:38 PM, "Noorul Islam K M"wrote: > > Spark version: 1.6.1 > Cluster Manager: Standalone > > I am experimenting with cluster mode deployment along with supervise for > high availability of streaming applications. > > 1. Submit a streaming job in cluster mode with supervise > 2. Say that driver is scheduled on worker1. The app started >successfully. > 3. Kill worker1 java process. This does not kill driver process and >hence the application (context) is still alive. > 4. Because of supervise flag, driver gets scheduled to new worker >worker2 and hence a new context is created, making it a duplicate. > > I think this seems to be a bug. > > Regards, > Noorul >
Cassandra read throughput using DataStax connector in Spark
Hello all, I am using DataStax connector to read data from Cassandra and write to another Cassandra cluster. Infra is Amazon. I have three nodes cluster with replication factor of 3 on both clusters. But the throughput seems to be very low. It takes 7 minutes to transfer around 2.5 GB/node. I think the bottleneck is at the read side as I could see that spark node (Independent of two clusters) is less loaded with respect to memory and CPU. I tried tweaking some from https://github.com/datastax/spark-cassandra-connector/blob/master/doc/reference.md#cassandra-connection-parameters Do you have any idea whether there is any parameter that I can tweak to get better throughput? Regards, Noorul - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org