Re: Spark streaming on standalone cluster

Wojciech Pituła Wed, 01 Jul 2015 02:48:08 -0700

Hi,
https://spark.apache.org/docs/latest/streaming-programming-guide.html


Points to remember

   -

   When running a Spark Streaming program locally, do not use “local” or
   “local[1]” as the master URL. Either of these means that only one thread
   will be used for running tasks locally. If you are using a input DStream
   based on a receiver (e.g. sockets, Kafka, Flume, etc.), then the single
   thread will be used to run the receiver, leaving no thread for processing
   the received data. Hence, when running locally, always use “local[*n*]”
   as the master URL where *n* > number of receivers to run (see Spark
   Properties
   
<https://spark.apache.org/docs/latest/configuration.html#spark-properties.html>
for
   information on how to set the master).


śr., 1.07.2015 o 11:25 użytkownik Borja Garrido Bear <kazebo...@gmail.com>
napisał:

> Hi all,
>
> Thanks for the answers, yes, my problem was I was using just one worker
> with one core, so it was starving and then I never get the job to run, now
> it seems it's working properly.
>
> One question, is this information in the docs? (because maybe I misread it)
>
> On Wed, Jul 1, 2015 at 10:30 AM, <prajod.vettiyat...@wipro.com> wrote:
>
>>  Spark streaming needs at least two threads on the worker/slave side. I
>> have seen this issue when(to test the behavior), I set the thread count for
>> spark streaming to 1. It should be atleast 2: one for the receiver
>> adapter(kafka, flume etc) and the second for processing the data.
>>
>>
>>
>> But I tested that in local mode: “--master local[2] “. The same issue
>> could happen in worker also.  If you set “--master local[1] “ the streaming
>> worker/slave blocks due to starvation.
>>
>>
>>
>> Which conf parameter sets the worker thread count in cluster mode ? is it
>> spark.akka.threads ?
>>
>>
>>
>> *From:* Tathagata Das [mailto:t...@databricks.com]
>> *Sent:* 01 July 2015 01:32
>> *To:* Borja Garrido Bear
>> *Cc:* user
>> *Subject:* Re: Spark streaming on standalone cluster
>>
>>
>>
>> How many receivers do you have in the streaming program? You have to have
>> more numbers of core in reserver by your spar application than the number
>> of receivers. That would explain the receiving output after stopping.
>>
>>
>>
>> TD
>>
>>
>>
>> On Tue, Jun 30, 2015 at 7:59 AM, Borja Garrido Bear <kazebo...@gmail.com>
>> wrote:
>>
>>  Hi all,
>>
>>
>>
>> I'm running a spark standalone cluster with one master and one slave
>> (different machines and both in version 1.4.0), the thing is I have a spark
>> streaming job that gets data from Kafka, and the just prints it.
>>
>>
>>
>> To configure the cluster I just started the master and then the slaves
>> pointing to it, as everything appears in the web interface I assumed
>> everything was fine, but maybe I missed some configuration.
>>
>>
>>
>> When I run it locally there is no problem, it works.
>>
>> When I run it in the cluster the worker state appears as "loading"
>>
>>  - If the job is a Scala one, when I stop it I receive all the output
>>
>>  - If the job is Python, when I stop it I receive a bunch of these
>> exceptions
>>
>>
>>
>>
>> \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
>>
>>
>>
>> ERROR JobScheduler: Error running job streaming job 1435675420000 ms.0
>>
>> py4j.Py4JException: An exception was raised by the Python Proxy. Return
>> Message: null
>>
>> at py4j.Protocol.getReturnValue(Protocol.java:417)
>>
>> at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:113)
>>
>> at com.sun.proxy.$Proxy14.call(Unknown Source)
>>
>> at
>> org.apache.spark.streaming.api.python.TransformFunction.apply(PythonDStream.scala:63)
>>
>> at
>> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>>
>> at
>> org.apache.spark.streaming.api.python.PythonDStream$$anonfun$callForeachRDD$1.apply(PythonDStream.scala:156)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>
>> at
>> org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
>>
>> at scala.util.Try$.apply(Try.scala:161)
>>
>> at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
>>
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:193)
>>
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>>
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:193)
>>
>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>
>> at
>> org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:192)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>
>> at java.lang.Thread.run(Thread.java:745)
>>
>>
>>
>>
>> \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
>>
>>
>>
>> Is there any known issue with spark streaming and the standalone mode? or
>> with Python?
>>
>>
>>  The information contained in this electronic message and any
>> attachments to this message are intended for the exclusive use of the
>> addressee(s) and may contain proprietary, confidential or privileged
>> information. If you are not the intended recipient, you should not
>> disseminate, distribute or copy this e-mail. Please notify the sender
>> immediately and destroy all copies of this message and any attachments.
>> WARNING: Computer viruses can be transmitted via email. The recipient
>> should check this email and any attachments for the presence of viruses.
>> The company accepts no liability for any damage caused by any virus
>> transmitted by this email. www.wipro.com
>>
>
>

Re: Spark streaming on standalone cluster

Reply via email to