Re: How to use FlumeInputDStream in spark cluster?
) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:745) Thanks Ping From: Prannoy pran...@sigmoidanalytics.commailto:pran...@sigmoidanalytics.com Date: Friday, November 28, 2014 at 12:56 AM To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org Subject: Re: How to use FlumeInputDStream in spark cluster? Hi, BindException comes when two processes are using the same port. In your spark configuration just set (spark.ui.port,x), to some other port. x can be any number say 12345. BindException will not break your job in either case. Just to fix it change the port number. Thanks. On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] [hidden email]/user/SendEmail.jtp?type=nodenode=1i=0 wrote: I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works fine on a Standalone cluster but throw BindException on YARN mode. Is there a solution to this problem or FlumeInputDStream will not be working in a cluster environment? If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html To start a new topic under Apache Spark User List, email [hidden email]/user/SendEmail.jtp?type=nodenode=1i=1 To unsubscribe from Apache Spark User List, click here. NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml View this message in context: Re: How to use FlumeInputDStream in spark cluster?http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html Sent from the Apache Spark User List mailing list archivehttp://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
Hi, BindException comes when two processes are using the same port. In your spark configuration just set (spark.ui.port,x), to some other port. x can be any number say 12345. BindException will not break your job in either case. Just to fix it change the port number. Thanks. On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] ml-node+s1001560n1999...@n3.nabble.com wrote: I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works fine on a Standalone cluster but throw BindException on YARN mode. Is there a solution to this problem or FlumeInputDStream will not be working in a cluster environment? -- If you reply to this email, your message will be added to the discussion below: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html To start a new topic under Apache Spark User List, email ml-node+s1001560n1...@n3.nabble.com To unsubscribe from Apache Spark User List, click here http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2 . NAML http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
I have test the example codes FlumeEventCount on standalone cluster, and this is still a problem in Spark 1.1.0, the latest version up to now. Do you have solved this issue in your way? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p15102.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to use FlumeInputDStream in spark cluster?
Hey, Even i am getting the same error. I am running, sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount spark://spark_master_hostname:7077 spark_master_hostname 7781 and getting no events in the spark streaming. --- Time: 1395395676000 ms --- Received 0 flume events. 14/03/21 09:54:36 INFO JobScheduler: Finished job streaming job 1395395676000 ms.0 from job set of time 1395395676000 ms 14/03/21 09:54:36 INFO JobScheduler: Total delay: 0.196 s for time 1395395676000 ms (execution: 0.111 s) 14/03/21 09:54:38 INFO NetworkInputTracker: Stream 0 received 0 blocks 14/03/21 09:54:38 INFO SparkContext: Starting job: take at DStream.scala:586 14/03/21 09:54:38 INFO JobScheduler: Starting job streaming job 1395395678000 ms.0 from job set of time 1395395678000 ms 14/03/21 09:54:38 INFO DAGScheduler: Registering RDD 73 (combineByKey at ShuffledDStream.scala:42) 14/03/21 09:54:38 INFO DAGScheduler: Got job 16 (take at DStream.scala:586) with 1 output partitions (allowLocal=true) 14/03/21 09:54:38 INFO DAGScheduler: Final stage: Stage 31 (take at DStream.scala:586) 14/03/21 09:54:38 INFO DAGScheduler: Parents of final stage: List(Stage 32) 14/03/21 09:54:38 INFO JobScheduler: Added jobs for time 1395395678000 ms 14/03/21 09:54:38 INFO DAGScheduler: Missing parents: List(Stage 32) 14/03/21 09:54:38 INFO DAGScheduler: Submitting Stage 32 (MapPartitionsRDD[73] at combineByKey at ShuffledDStream.scala:42), which has no missing parents 14/03/21 09:54:38 INFO DAGScheduler: Submitting 1 missing tasks from Stage 32 (MapPartitionsRDD[73] at combineByKey at ShuffledDStream.scala:42) 14/03/21 09:54:38 INFO TaskSchedulerImpl: Adding task set 32.0 with 1 tasks 14/03/21 09:54:38 INFO TaskSetManager: Starting task 32.0:0 as TID 92 on executor 2: c8-data-store-4.srv.media.net (PROCESS_LOCAL) 14/03/21 09:54:38 INFO TaskSetManager: Serialized task 32.0:0 as 2971 bytes in 1 ms 14/03/21 09:54:38 INFO TaskSetManager: Finished TID 92 in 41 ms on c8-data-store-4.srv.media.net (progress: 0/1) 14/03/21 09:54:38 INFO TaskSchedulerImpl: Remove TaskSet 32.0 from pool Also on closer look, i got INFO SparkContext: Job finished: runJob at NetworkInputTracker.scala:182, took 0.523621327 s 14/03/21 09:54:35 ERROR NetworkInputTracker: De-registered receiver for network stream 0 with message org.jboss.netty.channel.ChannelException: Failed to bind to: c8-data-store-1.srv.media.net/172.16.200.124:7781 I couldnt understand the NetworkInputTracker that you told about. Can you elaborate that? I only understood that the master checks any one of the workers nodes for the connection and stays on it till the program runs. Why is it not checking on the host and port i am providing. Also, host and port should necessarily any worker node? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2987.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
Hi, This is my summary of the gap between expected behavior and actual behavior. FlumeEventCount spark://spark_master_hostname:7077 address port Expected: an 'agent' listening on address:port (bind to). In the context of Spark, this agent should be running on one of the slaves, which should be the slave whose ip/hostname is address. Observed: A random slave is chosen in the pool of available slaves. Therefore, in a cluster environment, is likely not the slave having the actual address, which in turn causes the 'Fail to bind to ...' error. This comes naturally because the slave that is running the code to bind to address:port has a different ip. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2990.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: he actual address, which in turn causes the 'Fail to bind to ...' error. This comes naturally because the slave that is running the code to bind to address:port has a different ip. So if we run the code on the slave where we are sending the data using flume agent, it should work. Let me give a shot to this and check what is happening. Thanks you for the immediate reply. Ill keep you posted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2992.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote: he actual address, which in turn causes the 'Fail to bind to ...' error. This comes naturally because the slave that is running the code to bind to address:port has a different ip. I ran sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount spark://spark_master_hostname:7077 worker_hostname 7781 on worker_hostname and still it shows 14/03/21 13:12:12 ERROR scheduler.NetworkInputTracker: De-registered receiver for network stream 0 with message org.jboss.netty.channel.ChannelException: Failed to bind to:worker_hostname /worker_ipaddress:7781 14/03/21 13:12:12 INFO spark.SparkContext: Job finished: runJob at NetworkInputTracker.scala:182, took 0.530447982 s 14/03/21 13:12:14 INFO scheduler.NetworkInputTracker: Stream 0 received 0 blocks Weird issue. I need to setup spark streaming and make it run. I am thinking to switch to kafka. I havent checked it yet but i dont see a work around for this. Any help would be good. I am making changes in the flume.conf and checking different settings. Thank you. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2993.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: How to use FlumeInputDStream in spark cluster?
It is my understanding that there is no way to make FlumeInputDStream work in a cluster environment with the current release. Switch to Kafka, if you can, would be my suggestion, although I have not used KafkaInputDStream. There is a big difference between Kafka and Flume InputDstream: KafkaInputDStreams are consumers (clients). FlumeInputDStream, which needs to listen on a specific address:port so other flume agent can send messages to. This may also give Kafka an advantage on performance too. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2994.html Sent from the Apache Spark User List mailing list archive at Nabble.com.