Multi DC Spark

Jason Nerothin Tue, 19 Apr 2016 07:26:03 -0700
It the main concern uptime or disaster recovery?

> On Apr 19, 2016, at 9:12 AM, Cody Koeninger <c...@koeninger.org> wrote:
> 
> I think the bigger question is what happens to Kafka and your downstream data 
> store when DC2 crashes.
> 
> From a Spark point of view, starting up a post-crash job in a new data center 
> isn't really different from starting up a post-crash job in the original data 
> center.
> 
> On Tue, Apr 19, 2016 at 3:32 AM, Erwan ALLAIN <eallain.po...@gmail.com 
> <mailto:eallain.po...@gmail.com>> wrote:
> Thanks Jason and Cody. I'll try to explain a bit better the Multi DC case.
> 
> As I mentionned before, I'm planning to use one kafka cluster and 2 or more 
> spark cluster distinct. 
> 
> Let's say we have the following DCs configuration in a nominal case. 
> Kafka partitions are consumed uniformly by the 2 datacenters.
> 
> DataCenter    Spark   Kafka Consumer Group    Kafka partition (P1 to P4)
> DC 1  Master 1.1      
> 
> Worker 1.1    my_group        P1
> Worker 1.2    my_group        P2
> DC 2  Master 2.1      
> 
> Worker 2.1    my_group        P3
> Worker 2.2    my_group        P4
> 
> I would like, in case of DC crash, a rebalancing of partition on the healthy 
> DC, something as follow
> 
> DataCenter    Spark   Kafka Consumer Group    Kafka partition (P1 to P4)
> DC 1  Master 1.1      
> 
> Worker 1.1    my_group        P1, P3
> Worker 1.2    my_group        P2, P4
> DC 2  Master 2.1      
> 
> Worker 2.1    my_group        P3
> Worker 2.2    my_group        P4
> 
> I would like to know if it's possible:
> - using consumer group ?
> - using direct approach ? I prefer this one as I don't want to activate WAL.
> 
> Hope the explanation is better !
> 
> 
> On Mon, Apr 18, 2016 at 6:50 PM, Cody Koeninger <c...@koeninger.org 
> <mailto:c...@koeninger.org>> wrote:
> The current direct stream only handles exactly the partitions
> specified at startup.  You'd have to restart the job if you changed
> partitions.
> 
> https://issues.apache.org/jira/browse/SPARK-12177 
> <https://issues.apache.org/jira/browse/SPARK-12177> has the ongoing work
> towards using the kafka 0.10 consumer, which would allow for dynamic
> topicparittions
> 
> Regarding your multi-DC questions, I'm not really clear on what you're saying.
> 
> On Mon, Apr 18, 2016 at 11:18 AM, Erwan ALLAIN <eallain.po...@gmail.com 
> <mailto:eallain.po...@gmail.com>> wrote:
> > Hello,
> >
> > I'm currently designing a solution where 2 distinct clusters Spark (2
> > datacenters) share the same Kafka (Kafka rack aware or manual broker
> > repartition).
> > The aims are
> > - preventing DC crash: using kafka resiliency and consumer group mechanism
> > (or else ?)
> > - keeping consistent offset among replica (vs mirror maker,which does not
> > keep offset)
> >
> > I have several questions
> >
> > 1) Dynamic repartition (one or 2 DC)
> >
> > I'm using KafkaDirectStream which map one partition kafka with one spark. Is
> > it possible to handle new or removed partition ?
> > In the compute method, it looks like we are always using the currentOffset
> > map to query the next batch and therefore it's always the same number of
> > partition ? Can we request metadata at each batch ?
> >
> > 2) Multi DC Spark
> >
> > Using Direct approach, a way to achieve this would be
> > - to "assign" (kafka 0.9 term) all topics to the 2 sparks
> > - only one is reading the partition (Check every x interval, "lock" stored
> > in cassandra for instance)
> >
> > => not sure if it works just an idea
> >
> > Using Consumer Group
> > - CommitOffset manually at the end of the batch
> >
> > => Does spark handle partition rebalancing ?
> >
> > I'd appreciate any ideas ! Let me know if it's not clear.
> >
> > Erwan
> >
> >
> 
>
Re: Spark Streaming / Kafka Direct Approach: Dynamic Partitionning / Multi DC Spark

Reply via email to