Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-05 Thread Pierce Lamb
Hi Nipun,

To expand a bit, you might find this stackoverflow answer useful:

http://stackoverflow.com/a/39753976/3723346

Most spark + database combinations can handle a use case like this.

Hope this helps,

Pierce

On Thu, May 4, 2017 at 9:18 AM, Gene Pang  wrote:

> As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help
> you. Here is some documentation on how to run Alluxio and Spark together
> , and
> here is a blog post on a Spark streaming + Alluxio use case
> 
> .
>
> Hope that helps,
> Gene
>
> On Tue, May 2, 2017 at 11:56 AM, Nipun Arora 
> wrote:
>
>> Hi All,
>>
>> To support our Spark Streaming based anomaly detection tool, we have made
>> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>>
>> I'll first explain our use-case, which I believe should be common to
>> several people using Spark Streaming applications. Broadcast variables are
>> often used to store values "machine learning models", which can then be
>> used on streaming data to "test" and get the desired results (for our case
>> anomalies). Unfortunately, in the current spark, broadcast variables are
>> final and can only be initialized once before the initialization of the
>> streaming context. Hence, if a new model is learned the streaming system
>> cannot be updated without shutting down the application, broadcasting
>> again, and restarting the application. Our goal was to re-broadcast
>> variables without requiring a downtime of the streaming service.
>>
>> The key to this implementation is a live re-broadcastVariable()
>> interface, which can be triggered in between micro-batch executions,
>> without any re-boot required for the streaming application. At a high level
>> the task is done by re-fetching broadcast variable information from the
>> spark driver, and then re-distribute it to the workers. The micro-batch
>> execution is blocked while the update is made, by taking a lock on the
>> execution. We have already tested this in our prototype deployment of our
>> anomaly detection service and can successfully re-broadcast the broadcast
>> variables with no downtime.
>>
>> We would like to integrate these changes in spark, can anyone please let
>> me know the process of submitting patches/ new features to spark. Also. I
>> understand that the current version of Spark is 2.1. However, our changes
>> have been done and tested on Spark 1.6.2, will this be a problem?
>>
>> Thanks
>> Nipun
>>
>
>


Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-04 Thread Gene Pang
As Tim pointed out, Alluxio (renamed from Tachyon) may be able to help you.
Here is some documentation on how to run Alluxio and Spark together
, and
here is a blog post on a Spark streaming + Alluxio use case

.

Hope that helps,
Gene

On Tue, May 2, 2017 at 11:56 AM, Nipun Arora 
wrote:

> Hi All,
>
> To support our Spark Streaming based anomaly detection tool, we have made
> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>
> I'll first explain our use-case, which I believe should be common to
> several people using Spark Streaming applications. Broadcast variables are
> often used to store values "machine learning models", which can then be
> used on streaming data to "test" and get the desired results (for our case
> anomalies). Unfortunately, in the current spark, broadcast variables are
> final and can only be initialized once before the initialization of the
> streaming context. Hence, if a new model is learned the streaming system
> cannot be updated without shutting down the application, broadcasting
> again, and restarting the application. Our goal was to re-broadcast
> variables without requiring a downtime of the streaming service.
>
> The key to this implementation is a live re-broadcastVariable() interface,
> which can be triggered in between micro-batch executions, without any
> re-boot required for the streaming application. At a high level the task is
> done by re-fetching broadcast variable information from the spark driver,
> and then re-distribute it to the workers. The micro-batch execution is
> blocked while the update is made, by taking a lock on the execution. We
> have already tested this in our prototype deployment of our anomaly
> detection service and can successfully re-broadcast the broadcast variables
> with no downtime.
>
> We would like to integrate these changes in spark, can anyone please let
> me know the process of submitting patches/ new features to spark. Also. I
> understand that the current version of Spark is 2.1. However, our changes
> have been done and tested on Spark 1.6.2, will this be a problem?
>
> Thanks
> Nipun
>


Re: [Spark Streaming] Dynamic Broadcast Variable Update

2017-05-02 Thread Tim Smith
One, I think, you should take this to the spark developer list.

Two, I suspect broadcast variables aren't the best solution for the use
case, you describe. Maybe an in-memory data/object/file store like tachyon
is a better fit.

Thanks,

Tim


On Tue, May 2, 2017 at 11:56 AM, Nipun Arora 
wrote:

> Hi All,
>
> To support our Spark Streaming based anomaly detection tool, we have made
> a patch in Spark 1.6.2 to dynamically update broadcast variables.
>
> I'll first explain our use-case, which I believe should be common to
> several people using Spark Streaming applications. Broadcast variables are
> often used to store values "machine learning models", which can then be
> used on streaming data to "test" and get the desired results (for our case
> anomalies). Unfortunately, in the current spark, broadcast variables are
> final and can only be initialized once before the initialization of the
> streaming context. Hence, if a new model is learned the streaming system
> cannot be updated without shutting down the application, broadcasting
> again, and restarting the application. Our goal was to re-broadcast
> variables without requiring a downtime of the streaming service.
>
> The key to this implementation is a live re-broadcastVariable() interface,
> which can be triggered in between micro-batch executions, without any
> re-boot required for the streaming application. At a high level the task is
> done by re-fetching broadcast variable information from the spark driver,
> and then re-distribute it to the workers. The micro-batch execution is
> blocked while the update is made, by taking a lock on the execution. We
> have already tested this in our prototype deployment of our anomaly
> detection service and can successfully re-broadcast the broadcast variables
> with no downtime.
>
> We would like to integrate these changes in spark, can anyone please let
> me know the process of submitting patches/ new features to spark. Also. I
> understand that the current version of Spark is 2.1. However, our changes
> have been done and tested on Spark 1.6.2, will this be a problem?
>
> Thanks
> Nipun
>



-- 

--
Thanks,

Tim