Re: Advice on multiple streaming job

2018-05-08 Thread Peter Liu
Hi Dhaval,

I'm using Yarn scheduler (without the need to specify the port in the
submit). Not sue why the port issue here.

Gerard seem to have a good point here to have the multiple topics managed
within your application (to avoid the port issue) - Not sure if you're
using Spark Streaming or Spark Structured Streaming (see different
developer links for spark 2.2.0 below, but the same for the latest version).

https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html

I'm new to Spark streaming and I was curious what is the reason in your
case to have to run multiple spark service. Is this because of the "fact"
(just my question) that each service can only maintain one dstream?

I'm reading the following part from the guide above (
https://spark.apache.org/docs/2.2.0/streaming-programming-guide.html ) and
was wondering if having multiple topics (split one into multiple) would be
a good practice to enable multiple dstreams and thus better parallelism in
the data processing?

Quote:>> Note that each input DStream creates a single receiver (running on
a worker machine) that receives a single stream of data

Any comment from you guys would be much appreciated!

Cheers,

Peter


On Mon, May 7, 2018 at 5:08 AM, Dhaval Modi  wrote:

> Hi Gerard,
>
> Our source is kafka, and we are using standard streaming api (DStreams).
>
> Our requirement is,  as we have 100's of kafka topics, Each topic sends
> different messages in JSON (complex) format. Topics structured are as per
> domain.
> Hence, each topic is independent of each other.
> These JSON messages needs to be flattened and stored in Hive.
>
> For these 100's of topic, currently we have 100's of jobs running
> independently and using different UI port.
>
>
>
> Regards,
> Dhaval Modi
> dhavalmod...@gmail.com
>
> On 7 May 2018 at 13:53, Gerard Maas  wrote:
>
>> Dhaval,
>>
>> Which Streaming API are you using?
>> In Structured Streaming, you are able to start several streaming queries
>> within the same context.
>>
>> kind regards, Gerard.
>>
>> On Sun, May 6, 2018 at 7:59 PM, Dhaval Modi 
>> wrote:
>>
>>> Hi Susan,
>>>
>>> Thanks for your response.
>>>
>>> Will try configuration as suggested.
>>>
>>> But still i am looking for answer does Spark support running multiple
>>> jobs on the same port?
>>>
>>> On Sun, May 6, 2018, 20:27 Susan X. Huynh  wrote:
>>>
 Hi Dhaval,

 Not sure if you have considered this: the port 4040 sounds like a
 driver UI port. By default it will try up to 4056, but you can increase
 that number with "spark.port.maxRetries". (
 https://spark.apache.org/docs/latest/configuration.html) Try setting
 it to "32". This would help if the only conflict is among the driver UI
 ports (like if you have > 16 drivers running on the same host).

 Susan

 On Sun, May 6, 2018 at 12:32 AM, vincent gromakowski <
 vincent.gromakow...@gmail.com> wrote:

> Use a scheduler that abstract the network away with a CNI for instance
> or other mécanismes (mesos, kubernetes, yarn). The CNI will allow to 
> always
> bind on the same ports because each container will have its own IP. Some
> other solution like mesos and marathon can work without CNI , with host IP
> binding, but will manage the ports for you ensuring there isn't any
> conflict.
>
> Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a
> écrit :
>
>> Hi All,
>>
>> Need advice on executing multiple streaming jobs.
>>
>> Problem:- We have 100's of streaming job. Every streaming job uses
>> new port. Also, Spark automatically checks port from 4040 to 4056, post
>> that it fails. One of the workaround, is to provide port explicitly.
>>
>> Is there a way to tackle this situation? or Am I missing any thing?
>>
>> Thanking you in advance.
>>
>> Regards,
>> Dhaval Modi
>> dhavalmod...@gmail.com
>>
>


 --
 Susan X. Huynh
 Software engineer, Data Agility
 xhu...@mesosphere.com

>>>
>>
>


Re: Advice on multiple streaming job

2018-05-07 Thread Dhaval Modi
Hi Gerard,

Our source is kafka, and we are using standard streaming api (DStreams).

Our requirement is,  as we have 100's of kafka topics, Each topic sends
different messages in JSON (complex) format. Topics structured are as per
domain.
Hence, each topic is independent of each other.
These JSON messages needs to be flattened and stored in Hive.

For these 100's of topic, currently we have 100's of jobs running
independently and using different UI port.



Regards,
Dhaval Modi
dhavalmod...@gmail.com

On 7 May 2018 at 13:53, Gerard Maas  wrote:

> Dhaval,
>
> Which Streaming API are you using?
> In Structured Streaming, you are able to start several streaming queries
> within the same context.
>
> kind regards, Gerard.
>
> On Sun, May 6, 2018 at 7:59 PM, Dhaval Modi 
> wrote:
>
>> Hi Susan,
>>
>> Thanks for your response.
>>
>> Will try configuration as suggested.
>>
>> But still i am looking for answer does Spark support running multiple
>> jobs on the same port?
>>
>> On Sun, May 6, 2018, 20:27 Susan X. Huynh  wrote:
>>
>>> Hi Dhaval,
>>>
>>> Not sure if you have considered this: the port 4040 sounds like a driver
>>> UI port. By default it will try up to 4056, but you can increase that
>>> number with "spark.port.maxRetries". (https://spark.apache.org/docs
>>> /latest/configuration.html) Try setting it to "32". This would help if
>>> the only conflict is among the driver UI ports (like if you have > 16
>>> drivers running on the same host).
>>>
>>> Susan
>>>
>>> On Sun, May 6, 2018 at 12:32 AM, vincent gromakowski <
>>> vincent.gromakow...@gmail.com> wrote:
>>>
 Use a scheduler that abstract the network away with a CNI for instance
 or other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
 bind on the same ports because each container will have its own IP. Some
 other solution like mesos and marathon can work without CNI , with host IP
 binding, but will manage the ports for you ensuring there isn't any
 conflict.

 Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a
 écrit :

> Hi All,
>
> Need advice on executing multiple streaming jobs.
>
> Problem:- We have 100's of streaming job. Every streaming job uses new
> port. Also, Spark automatically checks port from 4040 to 4056, post that 
> it
> fails. One of the workaround, is to provide port explicitly.
>
> Is there a way to tackle this situation? or Am I missing any thing?
>
> Thanking you in advance.
>
> Regards,
> Dhaval Modi
> dhavalmod...@gmail.com
>

>>>
>>>
>>> --
>>> Susan X. Huynh
>>> Software engineer, Data Agility
>>> xhu...@mesosphere.com
>>>
>>
>


Re: Advice on multiple streaming job

2018-05-07 Thread Gerard Maas
Dhaval,

Which Streaming API are you using?
In Structured Streaming, you are able to start several streaming queries
within the same context.

kind regards, Gerard.

On Sun, May 6, 2018 at 7:59 PM, Dhaval Modi  wrote:

> Hi Susan,
>
> Thanks for your response.
>
> Will try configuration as suggested.
>
> But still i am looking for answer does Spark support running multiple jobs
> on the same port?
>
> On Sun, May 6, 2018, 20:27 Susan X. Huynh  wrote:
>
>> Hi Dhaval,
>>
>> Not sure if you have considered this: the port 4040 sounds like a driver
>> UI port. By default it will try up to 4056, but you can increase that
>> number with "spark.port.maxRetries". (https://spark.apache.org/
>> docs/latest/configuration.html) Try setting it to "32". This would help
>> if the only conflict is among the driver UI ports (like if you have > 16
>> drivers running on the same host).
>>
>> Susan
>>
>> On Sun, May 6, 2018 at 12:32 AM, vincent gromakowski <
>> vincent.gromakow...@gmail.com> wrote:
>>
>>> Use a scheduler that abstract the network away with a CNI for instance
>>> or other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
>>> bind on the same ports because each container will have its own IP. Some
>>> other solution like mesos and marathon can work without CNI , with host IP
>>> binding, but will manage the ports for you ensuring there isn't any
>>> conflict.
>>>
>>> Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a
>>> écrit :
>>>
 Hi All,

 Need advice on executing multiple streaming jobs.

 Problem:- We have 100's of streaming job. Every streaming job uses new
 port. Also, Spark automatically checks port from 4040 to 4056, post that it
 fails. One of the workaround, is to provide port explicitly.

 Is there a way to tackle this situation? or Am I missing any thing?

 Thanking you in advance.

 Regards,
 Dhaval Modi
 dhavalmod...@gmail.com

>>>
>>
>>
>> --
>> Susan X. Huynh
>> Software engineer, Data Agility
>> xhu...@mesosphere.com
>>
>


Re: Advice on multiple streaming job

2018-05-06 Thread Dhaval Modi
Hi Susan,

Thanks for your response.

Will try configuration as suggested.

But still i am looking for answer does Spark support running multiple jobs
on the same port?

On Sun, May 6, 2018, 20:27 Susan X. Huynh  wrote:

> Hi Dhaval,
>
> Not sure if you have considered this: the port 4040 sounds like a driver
> UI port. By default it will try up to 4056, but you can increase that
> number with "spark.port.maxRetries". (
> https://spark.apache.org/docs/latest/configuration.html) Try setting it
> to "32". This would help if the only conflict is among the driver UI ports
> (like if you have > 16 drivers running on the same host).
>
> Susan
>
> On Sun, May 6, 2018 at 12:32 AM, vincent gromakowski <
> vincent.gromakow...@gmail.com> wrote:
>
>> Use a scheduler that abstract the network away with a CNI for instance or
>> other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
>> bind on the same ports because each container will have its own IP. Some
>> other solution like mesos and marathon can work without CNI , with host IP
>> binding, but will manage the ports for you ensuring there isn't any
>> conflict.
>>
>> Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a
>> écrit :
>>
>>> Hi All,
>>>
>>> Need advice on executing multiple streaming jobs.
>>>
>>> Problem:- We have 100's of streaming job. Every streaming job uses new
>>> port. Also, Spark automatically checks port from 4040 to 4056, post that it
>>> fails. One of the workaround, is to provide port explicitly.
>>>
>>> Is there a way to tackle this situation? or Am I missing any thing?
>>>
>>> Thanking you in advance.
>>>
>>> Regards,
>>> Dhaval Modi
>>> dhavalmod...@gmail.com
>>>
>>
>
>
> --
> Susan X. Huynh
> Software engineer, Data Agility
> xhu...@mesosphere.com
>


Re: Advice on multiple streaming job

2018-05-06 Thread Dhaval Modi
Hi vincent,

Thanks for your response.

We are using YARN, and CNI may not be possible.

Thanks & Regards,
Dhaval


On Sun, May 6, 2018, 13:02 vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Use a scheduler that abstract the network away with a CNI for instance or
> other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
> bind on the same ports because each container will have its own IP. Some
> other solution like mesos and marathon can work without CNI , with host IP
> binding, but will manage the ports for you ensuring there isn't any
> conflict.
>
> Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a écrit :
>
>> Hi All,
>>
>> Need advice on executing multiple streaming jobs.
>>
>> Problem:- We have 100's of streaming job. Every streaming job uses new
>> port. Also, Spark automatically checks port from 4040 to 4056, post that it
>> fails. One of the workaround, is to provide port explicitly.
>>
>> Is there a way to tackle this situation? or Am I missing any thing?
>>
>> Thanking you in advance.
>>
>> Regards,
>> Dhaval Modi
>> dhavalmod...@gmail.com
>>
>


Re: Advice on multiple streaming job

2018-05-06 Thread Susan X. Huynh
Hi Dhaval,

Not sure if you have considered this: the port 4040 sounds like a driver UI
port. By default it will try up to 4056, but you can increase that number
with "spark.port.maxRetries". (
https://spark.apache.org/docs/latest/configuration.html) Try setting it to
"32". This would help if the only conflict is among the driver UI ports
(like if you have > 16 drivers running on the same host).

Susan

On Sun, May 6, 2018 at 12:32 AM, vincent gromakowski <
vincent.gromakow...@gmail.com> wrote:

> Use a scheduler that abstract the network away with a CNI for instance or
> other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
> bind on the same ports because each container will have its own IP. Some
> other solution like mesos and marathon can work without CNI , with host IP
> binding, but will manage the ports for you ensuring there isn't any
> conflict.
>
> Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a écrit :
>
>> Hi All,
>>
>> Need advice on executing multiple streaming jobs.
>>
>> Problem:- We have 100's of streaming job. Every streaming job uses new
>> port. Also, Spark automatically checks port from 4040 to 4056, post that it
>> fails. One of the workaround, is to provide port explicitly.
>>
>> Is there a way to tackle this situation? or Am I missing any thing?
>>
>> Thanking you in advance.
>>
>> Regards,
>> Dhaval Modi
>> dhavalmod...@gmail.com
>>
>


-- 
Susan X. Huynh
Software engineer, Data Agility
xhu...@mesosphere.com


Re: Advice on multiple streaming job

2018-05-06 Thread vincent gromakowski
Use a scheduler that abstract the network away with a CNI for instance or
other mécanismes (mesos, kubernetes, yarn). The CNI will allow to always
bind on the same ports because each container will have its own IP. Some
other solution like mesos and marathon can work without CNI , with host IP
binding, but will manage the ports for you ensuring there isn't any
conflict.

Le sam. 5 mai 2018 à 17:10, Dhaval Modi  a écrit :

> Hi All,
>
> Need advice on executing multiple streaming jobs.
>
> Problem:- We have 100's of streaming job. Every streaming job uses new
> port. Also, Spark automatically checks port from 4040 to 4056, post that it
> fails. One of the workaround, is to provide port explicitly.
>
> Is there a way to tackle this situation? or Am I missing any thing?
>
> Thanking you in advance.
>
> Regards,
> Dhaval Modi
> dhavalmod...@gmail.com
>


Advice on multiple streaming job

2018-05-05 Thread Dhaval Modi
Hi All,

Need advice on executing multiple streaming jobs.

Problem:- We have 100's of streaming job. Every streaming job uses new
port. Also, Spark automatically checks port from 4040 to 4056, post that it
fails. One of the workaround, is to provide port explicitly.

Is there a way to tackle this situation? or Am I missing any thing?

Thanking you in advance.

Regards,
Dhaval Modi
dhavalmod...@gmail.com