Re: spark streaming with kafka source, how many concurrent jobs?
Thanks TD. On Tue, Mar 14, 2017 at 4:37 PM, Tathagata Daswrote: > This setting allows multiple spark jobs generated through multiple > foreachRDD to run concurrently, even if they are across batches. So output > op2 from batch X, can run concurrently with op1 of batch X+1 > This is not safe because it breaks the checkpointing logic in subtle ways. > Note that this was never documented in the spark online docs. > > On Tue, Mar 14, 2017 at 2:29 PM, shyla deshpande > wrote: > >> Thanks TD for the response. Can you please provide more explanation. I am >> having multiple streams in the spark streaming application (Spark 2.0.2 >> using DStreams). I know many people using this setting. So your >> explanation will help a lot of people. >> >> Thanks >> >> On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Das >> wrote: >> >>> That config I not safe. Please do not use it. >>> >>> On Mar 10, 2017 10:03 AM, "shyla deshpande" >>> wrote: >>> I have a spark streaming application which processes 3 kafka streams and has 5 output operations. Not sure what should be the setting for spark.streaming.concurrentJobs. 1. If the concurrentJobs setting is 4 does that mean 2 output operations will be run sequentially? 2. If I had 6 cores what would be a ideal setting for concurrentJobs in this situation? I appreciate your input. Thanks >>> >> >
Re: spark streaming with kafka source, how many concurrent jobs?
This setting allows multiple spark jobs generated through multiple foreachRDD to run concurrently, even if they are across batches. So output op2 from batch X, can run concurrently with op1 of batch X+1 This is not safe because it breaks the checkpointing logic in subtle ways. Note that this was never documented in the spark online docs. On Tue, Mar 14, 2017 at 2:29 PM, shyla deshpandewrote: > Thanks TD for the response. Can you please provide more explanation. I am > having multiple streams in the spark streaming application (Spark 2.0.2 > using DStreams). I know many people using this setting. So your > explanation will help a lot of people. > > Thanks > > On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Das > wrote: > >> That config I not safe. Please do not use it. >> >> On Mar 10, 2017 10:03 AM, "shyla deshpande" >> wrote: >> >>> I have a spark streaming application which processes 3 kafka streams and >>> has 5 output operations. >>> >>> Not sure what should be the setting for spark.streaming.concurrentJobs. >>> >>> 1. If the concurrentJobs setting is 4 does that mean 2 output operations >>> will be run sequentially? >>> >>> 2. If I had 6 cores what would be a ideal setting for concurrentJobs in >>> this situation? >>> >>> I appreciate your input. Thanks >>> >> >
Re: spark streaming with kafka source, how many concurrent jobs?
Thanks TD for the response. Can you please provide more explanation. I am having multiple streams in the spark streaming application (Spark 2.0.2 using DStreams). I know many people using this setting. So your explanation will help a lot of people. Thanks On Fri, Mar 10, 2017 at 6:24 PM, Tathagata Daswrote: > That config I not safe. Please do not use it. > > On Mar 10, 2017 10:03 AM, "shyla deshpande" > wrote: > >> I have a spark streaming application which processes 3 kafka streams and >> has 5 output operations. >> >> Not sure what should be the setting for spark.streaming.concurrentJobs. >> >> 1. If the concurrentJobs setting is 4 does that mean 2 output operations >> will be run sequentially? >> >> 2. If I had 6 cores what would be a ideal setting for concurrentJobs in >> this situation? >> >> I appreciate your input. Thanks >> >
Re: spark streaming with kafka source, how many concurrent jobs?
That config I not safe. Please do not use it. On Mar 10, 2017 10:03 AM, "shyla deshpande"wrote: > I have a spark streaming application which processes 3 kafka streams and > has 5 output operations. > > Not sure what should be the setting for spark.streaming.concurrentJobs. > > 1. If the concurrentJobs setting is 4 does that mean 2 output operations > will be run sequentially? > > 2. If I had 6 cores what would be a ideal setting for concurrentJobs in > this situation? > > I appreciate your input. Thanks >
spark streaming with kafka source, how many concurrent jobs?
I have a spark streaming application which processes 3 kafka streams and has 5 output operations. Not sure what should be the setting for spark.streaming.concurrentJobs. 1. If the concurrentJobs setting is 4 does that mean 2 output operations will be run sequentially? 2. If I had 6 cores what would be a ideal setting for concurrentJobs in this situation? I appreciate your input. Thanks
Re: PySpark concurrent jobs using single SparkContext
It seems like you want simultaneous processing of multiple jobs but at the same time serialization of few tasks within those jobs. I don't know how to achieve that in Spark. But, why would you bother about the inter-weaved processing when the data that is being aggregated in different jobs is per customer per day? Is it that save_aggregate depends on results of other customers and/or other days? I also don't understand how you would achieve that with yarn because interweaving of tasks of separately submitted jobs may happen with dynamic executor allocation as well. Hemant On Thu, Aug 20, 2015 at 7:04 PM, Mike Sukmanowsky mike.sukmanow...@gmail.com wrote: Hi all, We're using Spark 1.3.0 via a small YARN cluster to do some log processing. The jobs are pretty simple, for a number of customers and a number of days, fetch some event log data, build aggregates and store those aggregates into a data store. The way our script is written right now does something akin to: with SparkContext() as sc: for customer in customers: for day in days: logs = sc.textFile(get_logs(customer, day)) aggregate = make_aggregate(logs) # This function contains the action saveAsNewAPIHadoopFile which # triggers a save save_aggregate(aggregate) So we have a Spark job per customer, per day. I tried doing some parallel job submission with something similar to: def make_and_save_aggregate(customer, day, spark_context): # Without a separate threading.Lock() here or better yet, one guarding the # Spark context, multiple customer/day transformations and actions could # be interweaved sc = spark_context logs = sc.textFile(get_logs(customer, day)) aggregate = make_aggregate(logs) save_aggregate(aggregate) with SparkContext() as sc, futures.ThreadPoolExecutor(4) as executor: for customer in customers: for day in days: executor.submit(make_and_save_aggregate, customer, day, sc) The problem is, with no locks on a SparkContext except during initialization https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L214-L241 and shutdown https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L296-L307, operations on the context could (if I understand correctly) be interweaved leading to DAG which contains transformations out of order and from different customer, day periods. One solution is instead to launch multiple Spark jobs via spark-submit and let YARN/Spark's dynamic executor allocation take care of fair scheduling. In practice, this doesn't seem to yield very fast computation perhaps due to some additional overhead with YARN. Is there any safe way to launch concurrent jobs like this using a single PySpark context? -- Mike Sukmanowsky Aspiring Digital Carpenter *e*: mike.sukmanow...@gmail.com LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky
PySpark concurrent jobs using single SparkContext
Hi all, We're using Spark 1.3.0 via a small YARN cluster to do some log processing. The jobs are pretty simple, for a number of customers and a number of days, fetch some event log data, build aggregates and store those aggregates into a data store. The way our script is written right now does something akin to: with SparkContext() as sc: for customer in customers: for day in days: logs = sc.textFile(get_logs(customer, day)) aggregate = make_aggregate(logs) # This function contains the action saveAsNewAPIHadoopFile which # triggers a save save_aggregate(aggregate) So we have a Spark job per customer, per day. I tried doing some parallel job submission with something similar to: def make_and_save_aggregate(customer, day, spark_context): # Without a separate threading.Lock() here or better yet, one guarding the # Spark context, multiple customer/day transformations and actions could # be interweaved sc = spark_context logs = sc.textFile(get_logs(customer, day)) aggregate = make_aggregate(logs) save_aggregate(aggregate) with SparkContext() as sc, futures.ThreadPoolExecutor(4) as executor: for customer in customers: for day in days: executor.submit(make_and_save_aggregate, customer, day, sc) The problem is, with no locks on a SparkContext except during initialization https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L214-L241 and shutdown https://github.com/apache/spark/blob/v1.3.0/python/pyspark/context.py#L296-L307, operations on the context could (if I understand correctly) be interweaved leading to DAG which contains transformations out of order and from different customer, day periods. One solution is instead to launch multiple Spark jobs via spark-submit and let YARN/Spark's dynamic executor allocation take care of fair scheduling. In practice, this doesn't seem to yield very fast computation perhaps due to some additional overhead with YARN. Is there any safe way to launch concurrent jobs like this using a single PySpark context? -- Mike Sukmanowsky Aspiring Digital Carpenter *e*: mike.sukmanow...@gmail.com LinkedIn http://www.linkedin.com/profile/view?id=10897143 | github https://github.com/msukmanowsky
concurrent jobs
By looking at the code of JobScheduler, I find a parameter of below: private val numConcurrentJobs = ssc.conf.getInt(spark.streaming.concurrentJobs, 1) private val jobExecutor = Executors.newFixedThreadPool(numConcurrentJobs) Does that mean each App can have only one active stage? In my psydo-code below: S1 = viewDStream.forEach( collect() ).. S2 = viewDStream.forEach( collect() ).. There should be two “collect()” jobs for each batch interval, right? Are they running in parallel? Thank you!
Re: launching concurrent jobs programmatically
Very interesting. One of spark's attractive features is being able to do stuff interactively via spark-shell. Is something like that still available via Ooyala's job server? Or do you use the spark-shell independently of that? If the latter then how do you manage custom jars for spark-shell? Our app has a number of jars that I don't particularly want to have to upload each time I want to run a small ad-hoc spark-shell session. Thanks, Ishaaq -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990p5033.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: launching concurrent jobs programmatically
For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. From the documentation: Second, *within* each Spark application, multiple “jobs” (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network; for example, the Shark http://shark.cs.berkeley.edu/server works this way. Spark includes a fair schedulerhttps://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application to schedule resources within each SparkContext. https://spark.apache.org/docs/latest/job-scheduling.html On Tue, Apr 29, 2014 at 1:39 AM, ishaaq ish...@gmail.com wrote: Hi all, I have a central app that currently kicks of old-style Hadoop M/R jobs either on-demand or via a scheduling mechanism. My intention is to gradually port this app over to using a Spark standalone cluster. The data will remain on HDFS. Couple of questions: 1. Is there a way to get Spark jobs to load from jars that have been pre-distributed to HDFS? I need to run these jobs programmatically from said application. 2. Is SparkContext meant to be used in multi-threaded use-cases? i.e. can multiple independent jobs run concurrently using the same SparkContext or should I create a new one each time my app needs to run a job? Thanks, Ishaaq -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
Re: launching concurrent jobs programmatically
In general, as Andrew points out, it's possible to submit jobs from multiple threads and many Spark applications do this. One thing to check out is the job server from Ooyala, this is an application on top of Spark that has an automated submission API: https://github.com/ooyala/spark-jobserver You can also accomplish this by just having a separate service that submits multiple jobs to a cluster where those jobs e.g. use different jars. - Patrick On Mon, Apr 28, 2014 at 4:44 PM, Andrew Ash and...@andrewash.com wrote: For the second question, you can submit multiple jobs through the same SparkContext via different threads and this is a supported way of interacting with Spark. From the documentation: Second, *within* each Spark application, multiple jobs (Spark actions) may be running concurrently if they were submitted by different threads. This is common if your application is serving requests over the network; for example, the Shark http://shark.cs.berkeley.edu/server works this way. Spark includes a fair schedulerhttps://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application to schedule resources within each SparkContext. https://spark.apache.org/docs/latest/job-scheduling.html On Tue, Apr 29, 2014 at 1:39 AM, ishaaq ish...@gmail.com wrote: Hi all, I have a central app that currently kicks of old-style Hadoop M/R jobs either on-demand or via a scheduling mechanism. My intention is to gradually port this app over to using a Spark standalone cluster. The data will remain on HDFS. Couple of questions: 1. Is there a way to get Spark jobs to load from jars that have been pre-distributed to HDFS? I need to run these jobs programmatically from said application. 2. Is SparkContext meant to be used in multi-threaded use-cases? i.e. can multiple independent jobs run concurrently using the same SparkContext or should I create a new one each time my app needs to run a job? Thanks, Ishaaq -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/launching-concurrent-jobs-programmatically-tp4990.html Sent from the Apache Spark User List mailing list archive at Nabble.com.