Re: How to use FlumeInputDStream in spark cluster?

2014-12-01 Thread Ping Tang
)

org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

org.apache.spark.scheduler.Task.run(Task.scala:54)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:180)


java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:745)

Thanks

Ping

From: Prannoy 
pran...@sigmoidanalytics.commailto:pran...@sigmoidanalytics.com
Date: Friday, November 28, 2014 at 12:56 AM
To: u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org 
u...@spark.incubator.apache.orgmailto:u...@spark.incubator.apache.org
Subject: Re: How to use FlumeInputDStream in spark cluster?

Hi,

BindException comes when two processes are using the same port. In your spark 
configuration just set (spark.ui.port,x),
to some other port. x can be any number say 12345. BindException will not 
break your job in either case. Just to fix it change the port number.

Thanks.

On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] [hidden 
email]/user/SendEmail.jtp?type=nodenode=1i=0 wrote:
I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works fine 
on a Standalone cluster but throw BindException on YARN mode. Is there a 
solution to this problem or FlumeInputDStream will not be working in a cluster 
environment?


If you reply to this email, your message will be added to the discussion below:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html
To start a new topic under Apache Spark User List, email [hidden 
email]/user/SendEmail.jtp?type=nodenode=1i=1
To unsubscribe from Apache Spark User List, click here.
NAMLhttp://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml



View this message in context: Re: How to use FlumeInputDStream in spark 
cluster?http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html
Sent from the Apache Spark User List mailing list 
archivehttp://apache-spark-user-list.1001560.n3.nabble.com/ at Nabble.com.


Re: How to use FlumeInputDStream in spark cluster?

2014-11-28 Thread Prannoy
Hi,

BindException comes when two processes are using the same port. In your
spark configuration just set (spark.ui.port,x),
to some other port. x can be any number say 12345. BindException will
not break your job in either case. Just to fix it change the port number.

Thanks.

On Fri, Nov 28, 2014 at 1:30 PM, pamtang [via Apache Spark User List] 
ml-node+s1001560n1999...@n3.nabble.com wrote:

 I'm seeing the same issue on CDH 5.2 with Spark 1.1. FlumeEventCount works
 fine on a Standalone cluster but throw BindException on YARN mode. Is there
 a solution to this problem or FlumeInputDStream will not be working in a
 cluster environment?

 --
  If you reply to this email, your message will be added to the discussion
 below:

 http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p19997.html
  To start a new topic under Apache Spark User List, email
 ml-node+s1001560n1...@n3.nabble.com
 To unsubscribe from Apache Spark User List, click here
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=1code=cHJhbm5veUBzaWdtb2lkYW5hbHl0aWNzLmNvbXwxfC0xNTI2NTg4NjQ2
 .
 NAML
 http://apache-spark-user-list.1001560.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p1.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use FlumeInputDStream in spark cluster?

2014-09-24 Thread julyfire
I have test the example codes FlumeEventCount on standalone cluster, and this
is still a problem in Spark 1.1.0, the latest version up to now. Do you have
solved this issue in your way?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p15102.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
Hey,


Even i am getting the same error. 

I am running, 

sudo ./run-example org.apache.spark.streaming.examples.FlumeEventCount
spark://spark_master_hostname:7077 spark_master_hostname 7781

and getting no events in the spark streaming. 

---
Time: 1395395676000 ms
---
Received 0 flume events.

14/03/21 09:54:36 INFO JobScheduler: Finished job streaming job
1395395676000 ms.0 from job set of time 1395395676000 ms
14/03/21 09:54:36 INFO JobScheduler: Total delay: 0.196 s for time
1395395676000 ms (execution: 0.111 s)
14/03/21 09:54:38 INFO NetworkInputTracker: Stream 0 received 0 blocks
14/03/21 09:54:38 INFO SparkContext: Starting job: take at DStream.scala:586
14/03/21 09:54:38 INFO JobScheduler: Starting job streaming job
1395395678000 ms.0 from job set of time 1395395678000 ms
14/03/21 09:54:38 INFO DAGScheduler: Registering RDD 73 (combineByKey at
ShuffledDStream.scala:42)
14/03/21 09:54:38 INFO DAGScheduler: Got job 16 (take at DStream.scala:586)
with 1 output partitions (allowLocal=true)
14/03/21 09:54:38 INFO DAGScheduler: Final stage: Stage 31 (take at
DStream.scala:586)
14/03/21 09:54:38 INFO DAGScheduler: Parents of final stage: List(Stage 32)
14/03/21 09:54:38 INFO JobScheduler: Added jobs for time 1395395678000 ms
14/03/21 09:54:38 INFO DAGScheduler: Missing parents: List(Stage 32)
14/03/21 09:54:38 INFO DAGScheduler: Submitting Stage 32
(MapPartitionsRDD[73] at combineByKey at ShuffledDStream.scala:42), which
has no missing parents
14/03/21 09:54:38 INFO DAGScheduler: Submitting 1 missing tasks from Stage
32 (MapPartitionsRDD[73] at combineByKey at ShuffledDStream.scala:42)
14/03/21 09:54:38 INFO TaskSchedulerImpl: Adding task set 32.0 with 1 tasks
14/03/21 09:54:38 INFO TaskSetManager: Starting task 32.0:0 as TID 92 on
executor 2: c8-data-store-4.srv.media.net (PROCESS_LOCAL)
14/03/21 09:54:38 INFO TaskSetManager: Serialized task 32.0:0 as 2971 bytes
in 1 ms
14/03/21 09:54:38 INFO TaskSetManager: Finished TID 92 in 41 ms on
c8-data-store-4.srv.media.net (progress: 0/1)
14/03/21 09:54:38 INFO TaskSchedulerImpl: Remove TaskSet 32.0 from pool 



Also on closer look, i got 

INFO SparkContext: Job finished: runJob at NetworkInputTracker.scala:182,
took 0.523621327 s
14/03/21 09:54:35 ERROR NetworkInputTracker: De-registered receiver for
network stream 0 with message org.jboss.netty.channel.ChannelException:
Failed to bind to: c8-data-store-1.srv.media.net/172.16.200.124:7781


I couldnt understand the NetworkInputTracker that you told about. Can you
elaborate that? 

I only understood that the master checks any one of the workers nodes for
the connection and stays on it till the program runs. Why is it not checking
on the host and port i am providing. Also, host and port should
necessarily any worker node? 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2987.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
Hi,

This is my summary of the gap between expected behavior and actual behavior.

FlumeEventCount spark://spark_master_hostname:7077 address port

Expected: an 'agent' listening on address:port (bind to). In the context
of Spark, this agent should be running on one of the slaves, which should be
the slave whose ip/hostname is address.

Observed: A random slave is chosen in the pool of available slaves.
Therefore, in a cluster environment, is likely not the slave having the
actual address, which in turn causes the 'Fail to bind to ...' error. This
comes naturally because the slave that is running the code to bind to
address:port has a different ip.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2990.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote:
 he actual address, which in turn causes the 'Fail to bind to ...' 
 error. This comes naturally because the slave that is running the code 
 to bind to address:port has a different ip. 
So if we run the code on the slave where we are sending the data using 
flume agent, it should work. Let me give a shot to this and check what 
is happening.

Thanks you for the immediate reply. Ill keep you posted.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2992.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread Ravi Hemnani
On 03/21/2014 06:17 PM, anoldbrain [via Apache Spark User List] wrote:
 he actual address, which in turn causes the 'Fail to bind to ...' 
 error. This comes naturally because the slave that is running the code 
 to bind to address:port has a different ip. 
I ran sudo ./run-example 
org.apache.spark.streaming.examples.FlumeEventCount 
spark://spark_master_hostname:7077 worker_hostname 7781 on 
worker_hostname and still it shows

14/03/21 13:12:12 ERROR scheduler.NetworkInputTracker: De-registered 
receiver for network stream 0 with message 
org.jboss.netty.channel.ChannelException: Failed to bind 
to:worker_hostname /worker_ipaddress:7781
14/03/21 13:12:12 INFO spark.SparkContext: Job finished: runJob at 
NetworkInputTracker.scala:182, took 0.530447982 s
14/03/21 13:12:14 INFO scheduler.NetworkInputTracker: Stream 0 received 
0 blocks

Weird issue. I need to setup spark streaming and make it run. I am 
thinking to switch to kafka. I havent checked it yet but i dont see a 
work around for this. Any help would be good. I am making changes in the 
flume.conf and checking different settings.

Thank you.





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2993.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: How to use FlumeInputDStream in spark cluster?

2014-03-21 Thread anoldbrain
It is my understanding that there is no way to make FlumeInputDStream work in
a cluster environment with the current release. Switch to Kafka, if you can,
would be my suggestion, although I have not used KafkaInputDStream. There is
a big difference between Kafka and Flume InputDstream: KafkaInputDStreams
are consumers (clients). FlumeInputDStream, which needs to listen on a
specific address:port so other flume agent can send messages to. This may
also give Kafka an advantage on performance too.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-use-FlumeInputDStream-in-spark-cluster-tp1604p2994.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.