Spark streaming on standalone cluster

2015-06-30 Thread Borja Garrido Bear
Hi all,

I'm running a spark standalone cluster with one master and one slave
(different machines and both in version 1.4.0), the thing is I have a spark
streaming job that gets data from Kafka, and the just prints it.

To configure the cluster I just started the master and then the slaves
pointing to it, as everything appears in the web interface I assumed
everything was fine, but maybe I missed some configuration.

When I run it locally there is no problem, it works.
When I run it in the cluster the worker state appears as "loading"
 - If the job is a Scala one, when I stop it I receive all the output
 - If the job is Python, when I stop it I receive a bunch of these
exceptions

\\\

ERROR JobScheduler: Error running job streaming job 143567542 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return
Message: null
at py4j.Protocol.getReturnValue(Protocol.java:417)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
at com.sun.proxy.$Proxy14.call(Unknown Source)
at
org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

\\\

Is there any known issue with spark streaming and the standalone mode? or
with Python?


Re: Spark streaming on standalone cluster

2015-06-30 Thread Tathagata Das
How many receivers do you have in the streaming program? You have to have
more numbers of core in reserver by your spar application than the number
of receivers. That would explain the receiving output after stopping.

TD

On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear 
wrote:

> Hi all,
>
> I'm running a spark standalone cluster with one master and one slave
> (different machines and both in version 1.4.0), the thing is I have a spark
> streaming job that gets data from Kafka, and the just prints it.
>
> To configure the cluster I just started the master and then the slaves
> pointing to it, as everything appears in the web interface I assumed
> everything was fine, but maybe I missed some configuration.
>
> When I run it locally there is no problem, it works.
> When I run it in the cluster the worker state appears as "loading"
>  - If the job is a Scala one, when I stop it I receive all the output
>  - If the job is Python, when I stop it I receive a bunch of these
> exceptions
>
>
> \\\
>
> ERROR JobScheduler: Error running job streaming job 143567542 ms.0
> py4j.Py4JException: An exception was raised by the Python Proxy. Return
> Message: null
> at py4j.Protocol.getReturnValue(Protocol.java:417)
> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
> at com.sun.proxy.$Proxy14.call(Unknown Source)
> at
> org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
> at scala.util.Try$.apply(Try.scala:161)
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
>
>
> \\\
>
> Is there any known issue with spark streaming and the standalone mode? or
> with Python?
>


RE: Spark streaming on standalone cluster

2015-07-01 Thread prajod.vettiyattil
Spark streaming needs at least two threads on the worker/slave side. I have 
seen this issue when(to test the behavior), I set the thread count for spark 
streaming to 1. It should be atleast 2: one for the receiver adapter(kafka, 
flume etc) and the second for processing the data.

But I tested that in local mode: “--master local[2] “. The same issue could 
happen in worker also.  If you set “--master local[1] “ the streaming 
worker/slave blocks due to starvation.

Which conf parameter sets the worker thread count in cluster mode ? is it 
spark.akka.threads ?

From: Tathagata Das [mailto:t...@databricks.com]
Sent: 01 July 2015 01:32
To: Borja Garrido Bear
Cc: user
Subject: Re: Spark streaming on standalone cluster

How many receivers do you have in the streaming program? You have to have more 
numbers of core in reserver by your spar application than the number of 
receivers. That would explain the receiving output after stopping.

TD

On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear 
mailto:kazebo...@gmail.com>> wrote:
Hi all,

I'm running a spark standalone cluster with one master and one slave (different 
machines and both in version 1.4.0), the thing is I have a spark streaming job 
that gets data from Kafka, and the just prints it.

To configure the cluster I just started the master and then the slaves pointing 
to it, as everything appears in the web interface I assumed everything was 
fine, but maybe I missed some configuration.

When I run it locally there is no problem, it works.
When I run it in the cluster the worker state appears as "loading"
 - If the job is a Scala one, when I stop it I receive all the output
 - If the job is Python, when I stop it I receive a bunch of these exceptions

\\\

ERROR JobScheduler: Error running job streaming job 143567542 ms.0
py4j.Py4JException: An exception was raised by the Python Proxy. Return 
Message: null
at py4j.Protocol.getReturnValue(Protocol.java:417)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
at com.sun.proxy.$Proxy14.call(Unknown Source)
at 
org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
at 
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at 
org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at 
org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at 
org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

\\\

Is there any known issue with spark streaming and the standalone mode? or with 
Python?

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any virus 
transmitted by this email. www.wipro.com


Re: Spark streaming on standalone cluster

2015-07-01 Thread Borja Garrido Bear
Hi all,

Thanks for the answers, yes, my problem was I was using just one worker
with one core, so it was starving and then I never get the job to run, now
it seems it's working properly.

One question, is this information in the docs? (because maybe I misread it)

On Wed, Jul 1, 2015 at 10:30 AM,  wrote:

>  Spark streaming needs at least two threads on the worker/slave side. I
> have seen this issue when(to test the behavior), I set the thread count for
> spark streaming to 1. It should be atleast 2: one for the receiver
> adapter(kafka, flume etc) and the second for processing the data.
>
>
>
> But I tested that in local mode: “--master local[2] “. The same issue
> could happen in worker also.  If you set “--master local[1] “ the streaming
> worker/slave blocks due to starvation.
>
>
>
> Which conf parameter sets the worker thread count in cluster mode ? is it
> spark.akka.threads ?
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* 01 July 2015 01:32
> *To:* Borja Garrido Bear
> *Cc:* user
> *Subject:* Re: Spark streaming on standalone cluster
>
>
>
> How many receivers do you have in the streaming program? You have to have
> more numbers of core in reserver by your spar application than the number
> of receivers. That would explain the receiving output after stopping.
>
>
>
> TD
>
>
>
> On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear 
> wrote:
>
>  Hi all,
>
>
>
> I'm running a spark standalone cluster with one master and one slave
> (different machines and both in version 1.4.0), the thing is I have a spark
> streaming job that gets data from Kafka, and the just prints it.
>
>
>
> To configure the cluster I just started the master and then the slaves
> pointing to it, as everything appears in the web interface I assumed
> everything was fine, but maybe I missed some configuration.
>
>
>
> When I run it locally there is no problem, it works.
>
> When I run it in the cluster the worker state appears as "loading"
>
>  - If the job is a Scala one, when I stop it I receive all the output
>
>  - If the job is Python, when I stop it I receive a bunch of these
> exceptions
>
>
>
>
> \\\
>
>
>
> ERROR JobScheduler: Error running job streaming job 143567542 ms.0
>
> py4j.Py4JException: An exception was raised by the Python Proxy. Return
> Message: null
>
> at py4j.Protocol.getReturnValue(Protocol.java:417)
>
> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
>
> at com.sun.proxy.$Proxy14.call(Unknown Source)
>
> at
> org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
>
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>
> at
> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at
> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>
> at scala.util.Try$.apply(Try.scala:161)
>
> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>
> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>
> at
> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> at java.lang.Thread.run(Thr

Re: Spark streaming on standalone cluster

2015-07-01 Thread Wojciech Pituła
Hi,
https://spark.apache.org/docs/latest/streaming-programming-guide.html

Points to remember

   -

   When running a Spark Streaming program locally, do not use “local” or
   “local[1]” as the master URL. Either of these means that only one thread
   will be used for running tasks locally. If you are using a input DStream
   based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
   thread will be used to run the receiver, leaving no thread for processing
   the received data. Hence, when running locally, always use “local[*n*]”
   as the master URL where *n* > number of receivers to run (see Spark
   Properties
   
<https://spark.apache.org/docs/latest/configuration.html#spark-properties.html>
for
   information on how to set the master).


śr., 1.07.2015 o 11:25 użytkownik Borja Garrido Bear 
napisał:

> Hi all,
>
> Thanks for the answers, yes, my problem was I was using just one worker
> with one core, so it was starving and then I never get the job to run, now
> it seems it's working properly.
>
> One question, is this information in the docs? (because maybe I misread it)
>
> On Wed, Jul 1, 2015 at 10:30 AM,  wrote:
>
>>  Spark streaming needs at least two threads on the worker/slave side. I
>> have seen this issue when(to test the behavior), I set the thread count for
>> spark streaming to 1. It should be atleast 2: one for the receiver
>> adapter(kafka, flume etc) and the second for processing the data.
>>
>>
>>
>> But I tested that in local mode: “--master local[2] “. The same issue
>> could happen in worker also.  If you set “--master local[1] “ the streaming
>> worker/slave blocks due to starvation.
>>
>>
>>
>> Which conf parameter sets the worker thread count in cluster mode ? is it
>> spark.akka.threads ?
>>
>>
>>
>> *From:* Tathagata Das [mailto:t...@databricks.com]
>> *Sent:* 01 July 2015 01:32
>> *To:* Borja Garrido Bear
>> *Cc:* user
>> *Subject:* Re: Spark streaming on standalone cluster
>>
>>
>>
>> How many receivers do you have in the streaming program? You have to have
>> more numbers of core in reserver by your spar application than the number
>> of receivers. That would explain the receiving output after stopping.
>>
>>
>>
>> TD
>>
>>
>>
>> On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear 
>> wrote:
>>
>>  Hi all,
>>
>>
>>
>> I'm running a spark standalone cluster with one master and one slave
>> (different machines and both in version 1.4.0), the thing is I have a spark
>> streaming job that gets data from Kafka, and the just prints it.
>>
>>
>>
>> To configure the cluster I just started the master and then the slaves
>> pointing to it, as everything appears in the web interface I assumed
>> everything was fine, but maybe I missed some configuration.
>>
>>
>>
>> When I run it locally there is no problem, it works.
>>
>> When I run it in the cluster the worker state appears as "loading"
>>
>>  - If the job is a Scala one, when I stop it I receive all the output
>>
>>  - If the job is Python, when I stop it I receive a bunch of these
>> exceptions
>>
>>
>>
>>
>> \\\
>>
>>
>>
>> ERROR JobScheduler: Error running job streaming job 143567542 ms.0
>>
>> py4j.Py4JException: An exception was raised by the Python Proxy. Return
>> Message: null
>>
>> at py4j.Protocol.getReturnValue(Protocol.java:417)
>>
>> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
>>
>> at com.sun.proxy.$Proxy14.call(Unknown Source)
>>
>> at
>> org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
>>
>> at
>> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>>
>> at
>> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.DStream.createR