Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-21 Thread shyla deshpande
Thanks TD.

On Tue, Mar 14, 2017 at 4:37 PM, Tathagata Das  wrote:

> This setting allows multiple spark jobs generated through multiple
> foreachRDD to run concurrently, even if they are across batches. So output
> op2 from batch X, can run concurrently with op1 of batch X+1
> This is not safe because it breaks the checkpointing logic in subtle ways.
> Note that this was never documented in the spark online docs.
>
> On Tue, Mar 14, 2017 at 2:29 PM, shyla deshpande  > wrote:
>
>> Thanks TD for the response. Can you please provide more explanation. I am
>>  having multiple streams in the spark streaming application (Spark 2.0.2
>> using DStreams).  I know many people using this setting. So your
>> explanation will help a lot of people.
>>
>> Thanks
>>
>> On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Das 
>> wrote:
>>
>>> That config I not safe. Please do not use it.
>>>
>>> On Mar 10, 2017 10:03 AM, "shyla deshpande" 
>>> wrote:
>>>
 I have a spark streaming application which processes 3 kafka streams
 and has 5 output operations.

 Not sure what should be the setting for spark.streaming.concurrentJobs.

 1. If the concurrentJobs setting is 4 does that mean 2 output
 operations will be run sequentially?

 2. If I had 6 cores what would be a ideal setting for concurrentJobs in
 this situation?

 I appreciate your input. Thanks

>>>
>>
>


Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread Tathagata Das
This setting allows multiple spark jobs generated through multiple
foreachRDD to run concurrently, even if they are across batches. So output
op2 from batch X, can run concurrently with op1 of batch X+1
This is not safe because it breaks the checkpointing logic in subtle ways.
Note that this was never documented in the spark online docs.

On Tue, Mar 14, 2017 at 2:29 PM, shyla deshpande 
wrote:

> Thanks TD for the response. Can you please provide more explanation. I am
>  having multiple streams in the spark streaming application (Spark 2.0.2
> using DStreams).  I know many people using this setting. So your
> explanation will help a lot of people.
>
> Thanks
>
> On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Das 
> wrote:
>
>> That config I not safe. Please do not use it.
>>
>> On Mar 10, 2017 10:03 AM, "shyla deshpande" 
>> wrote:
>>
>>> I have a spark streaming application which processes 3 kafka streams and
>>> has 5 output operations.
>>>
>>> Not sure what should be the setting for spark.streaming.concurrentJobs.
>>>
>>> 1. If the concurrentJobs setting is 4 does that mean 2 output operations
>>> will be run sequentially?
>>>
>>> 2. If I had 6 cores what would be a ideal setting for concurrentJobs in
>>> this situation?
>>>
>>> I appreciate your input. Thanks
>>>
>>
>


Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-14 Thread shyla deshpande
Thanks TD for the response. Can you please provide more explanation. I am
 having multiple streams in the spark streaming application (Spark 2.0.2
using DStreams).  I know many people using this setting. So your
explanation will help a lot of people.

Thanks

On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Das  wrote:

> That config I not safe. Please do not use it.
>
> On Mar 10, 2017 10:03 AM, "shyla deshpande" 
> wrote:
>
>> I have a spark streaming application which processes 3 kafka streams and
>> has 5 output operations.
>>
>> Not sure what should be the setting for spark.streaming.concurrentJobs.
>>
>> 1. If the concurrentJobs setting is 4 does that mean 2 output operations
>> will be run sequentially?
>>
>> 2. If I had 6 cores what would be a ideal setting for concurrentJobs in
>> this situation?
>>
>> I appreciate your input. Thanks
>>
>


Re: spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread Tathagata Das
That config I not safe. Please do not use it.

On Mar 10, 2017 10:03 AM, "shyla deshpande" 
wrote:

> I have a spark streaming application which processes 3 kafka streams and
> has 5 output operations.
>
> Not sure what should be the setting for spark.streaming.concurrentJobs.
>
> 1. If the concurrentJobs setting is 4 does that mean 2 output operations
> will be run sequentially?
>
> 2. If I had 6 cores what would be a ideal setting for concurrentJobs in
> this situation?
>
> I appreciate your input. Thanks
>


spark streaming with kafka source, how many concurrent jobs?

2017-03-10 Thread shyla deshpande
I have a spark streaming application which processes 3 kafka streams and
has 5 output operations.

Not sure what should be the setting for spark.streaming.concurrentJobs.

1. If the concurrentJobs setting is 4 does that mean 2 output operations
will be run sequentially?

2. If I had 6 cores what would be a ideal setting for concurrentJobs in
this situation?

I appreciate your input. Thanks


Re: PySpark concurrent jobs using single SparkContext

2015-08-21 Thread Hemant Bhanawat
It seems like you want simultaneous processing of multiple jobs but at the
same time serialization of few tasks within those jobs. I don't know how to
achieve that in Spark.

But, why would you bother about the inter-weaved processing when the data
that is being aggregated in different jobs is per customer per day? Is it
that save_aggregate depends on results of other customers and/or other
days?

I also don't understand how you would achieve that with yarn because
interweaving of tasks of separately submitted jobs may happen with dynamic
executor allocation as well.

Hemant


On Thu, Aug 20, 2015 at 7:04 PM, Mike Sukmanowsky 
mike.sukmanow...@gmail.com wrote:

 Hi all,

 We're using Spark 1.3.0 via a small YARN cluster to do some log
 processing. The jobs are pretty simple, for a number of customers and a
 number of days, fetch some event log data, build aggregates and store those
 aggregates into a data store.

 The way our script is written right now does something akin to:

 with SparkContext() as sc:
 for customer in customers:
 for day in days:
 logs = sc.textFile(get_logs(customer, day))
 aggregate = make_aggregate(logs)
 # This function contains the action saveAsNewAPIHadoopFile which
 # triggers a save
 save_aggregate(aggregate)

 ​
 So we have a Spark job per customer, per day.

 I tried doing some parallel job submission with something similar to:

 def make_and_save_aggregate(customer, day, spark_context):
 # Without a separate threading.Lock() here or better yet, one guarding the
 # Spark context, multiple customer/day transformations and actions could
 # be interweaved
 sc = spark_context
 logs = sc.textFile(get_logs(customer, day))
 aggregate = make_aggregate(logs)
 save_aggregate(aggregate)
 with SparkContext() as sc, futures.ThreadPoolExecutor(4) as executor:
 for customer in customers:
 for day in days:
 executor.submit(make_and_save_aggregate, customer, day, sc)

 ​
 The problem is, with no locks on a SparkContext except during
 initialization
 https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L214-L241
  and
 shutdown
 https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L296-L307,
 operations on the context could (if I understand correctly) be interweaved
 leading to DAG which contains transformations out of order and from
 different customer, day periods.

 One solution is instead to launch multiple Spark jobs via spark-submit and
 let YARN/Spark's dynamic executor allocation take care of fair scheduling.
 In practice, this doesn't seem to yield very fast computation perhaps due
 to some additional overhead with YARN.

 Is there any safe way to launch concurrent jobs like this using a single
 PySpark context?

 --
 Mike Sukmanowsky
 Aspiring Digital Carpenter

 *e*: mike.sukmanow...@gmail.com

 LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github
 https://github.com/msukmanowsky




PySpark concurrent jobs using single SparkContext

2015-08-20 Thread Mike Sukmanowsky
Hi all,

We're using Spark 1.3.0 via a small YARN cluster to do some log processing.
The jobs are pretty simple, for a number of customers and a number of days,
fetch some event log data, build aggregates and store those aggregates into
a data store.

The way our script is written right now does something akin to:

with SparkContext() as sc:
for customer in customers:
for day in days:
logs = sc.textFile(get_logs(customer, day))
aggregate = make_aggregate(logs)
# This function contains the action saveAsNewAPIHadoopFile which
# triggers a save
save_aggregate(aggregate)

​
So we have a Spark job per customer, per day.

I tried doing some parallel job submission with something similar to:

def make_and_save_aggregate(customer, day, spark_context):
# Without a separate threading.Lock() here or better yet, one guarding the
# Spark context, multiple customer/day transformations and actions could
# be interweaved
sc = spark_context
logs = sc.textFile(get_logs(customer, day))
aggregate = make_aggregate(logs)
save_aggregate(aggregate)
with SparkContext() as sc, futures.ThreadPoolExecutor(4) as executor:
for customer in customers:
for day in days:
executor.submit(make_and_save_aggregate, customer, day, sc)

​
The problem is, with no locks on a SparkContext except during initialization
https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L214-L241
and
shutdown
https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L296-L307,
operations on the context could (if I understand correctly) be interweaved
leading to DAG which contains transformations out of order and from
different customer, day periods.

One solution is instead to launch multiple Spark jobs via spark-submit and
let YARN/Spark's dynamic executor allocation take care of fair scheduling.
In practice, this doesn't seem to yield very fast computation perhaps due
to some additional overhead with YARN.

Is there any safe way to launch concurrent jobs like this using a single
PySpark context?

-- 
Mike Sukmanowsky
Aspiring Digital Carpenter

*e*: mike.sukmanow...@gmail.com

LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github
https://github.com/msukmanowsky


concurrent jobs

2014-07-18 Thread Haopu Wang
By looking at the code of JobScheduler, I find a parameter of below:

 

  private val numConcurrentJobs = 
ssc.conf.getInt(spark.streaming.concurrentJobs, 1)

  private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs)

 

Does that mean each App can have only one active stage?

 

In my psydo-code below:

 

 S1 = viewDStream.forEach( collect() )..

 S2 = viewDStream.forEach( collect() )..

 

There should be two “collect()” jobs for each batch interval, right? Are they 
running in parallel?

 

Thank you!



Re: launching concurrent jobs programmatically

2014-04-29 Thread ishaaq
Very interesting.

One of spark's attractive features is being able to do stuff interactively
via spark-shell. Is something like that still available via Ooyala's job
server?

Or do you use the spark-shell independently of that? If the latter then how
do you manage custom jars for spark-shell? Our app has a number of jars that
I don't particularly want to have to upload each time I want to run a small
ad-hoc spark-shell session.

Thanks,
Ishaaq



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990p5033.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: launching concurrent jobs programmatically

2014-04-28 Thread Andrew Ash
For the second question, you can submit multiple jobs through the same
SparkContext via different threads and this is a supported way of
interacting with Spark.

From the documentation:

Second, *within* each Spark application, multiple “jobs” (Spark actions)
may be running concurrently if they were submitted by different threads.
This is common if your application is serving requests over the network;
for example, the Shark http://shark.cs.berkeley.edu/server works this
way. Spark includes a fair
schedulerhttps://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
to
schedule resources within each SparkContext.

 https://spark.apache.org/docs/latest/job-scheduling.html


On Tue, Apr 29, 2014 at 1:39 AM, ishaaq ish...@gmail.com wrote:

 Hi all,

 I have a central app that currently kicks of old-style Hadoop M/R jobs
 either on-demand or via a scheduling mechanism.

 My intention is to gradually port this app over to using a Spark standalone
 cluster. The data will remain on HDFS.

 Couple of questions:

 1. Is there a way to get Spark jobs to load from jars that have been
 pre-distributed to HDFS? I need to run these jobs programmatically from
 said
 application.

 2. Is SparkContext meant to be used in multi-threaded use-cases? i.e. can
 multiple independent jobs run concurrently using the same SparkContext or
 should I create a new one each time my app needs to run a job?

 Thanks,
 Ishaaq



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.



Re: launching concurrent jobs programmatically

2014-04-28 Thread Patrick Wendell
In general, as Andrew points out, it's possible to submit jobs from
multiple threads and many Spark applications do this.

One thing to check out is the job server from Ooyala, this is an
application on top of Spark that has an automated submission API:
https://github.com/ooyala/spark-jobserver

You can also accomplish this by just having a separate service that submits
multiple jobs to a cluster where those jobs e.g. use different jars.

- Patrick


On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote:

 For the second question, you can submit multiple jobs through the same
 SparkContext via different threads and this is a supported way of
 interacting with Spark.

 From the documentation:

 Second, *within* each Spark application, multiple jobs (Spark actions)
 may be running concurrently if they were submitted by different threads.
 This is common if your application is serving requests over the network;
 for example, the Shark http://shark.cs.berkeley.edu/server works this
 way. Spark includes a fair 
 schedulerhttps://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
  to
 schedule resources within each SparkContext.

  https://spark.apache.org/docs/latest/job-scheduling.html


 On Tue, Apr 29, 2014 at 1:39 AM, ishaaq ish...@gmail.com wrote:

 Hi all,

 I have a central app that currently kicks of old-style Hadoop M/R jobs
 either on-demand or via a scheduling mechanism.

 My intention is to gradually port this app over to using a Spark
 standalone
 cluster. The data will remain on HDFS.

 Couple of questions:

 1. Is there a way to get Spark jobs to load from jars that have been
 pre-distributed to HDFS? I need to run these jobs programmatically from
 said
 application.

 2. Is SparkContext meant to be used in multi-threaded use-cases? i.e. can
 multiple independent jobs run concurrently using the same SparkContext or
 should I create a new one each time my app needs to run a job?

 Thanks,
 Ishaaq



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.