subject:"Spark RDD DAG behaviour understanding in case of checkpointing"

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Tathagata Das

First of all, if you are running batches of 15 minutes, and you dont need second level latencies, it might be just easier to run batch jobs in a for loop - you will have greater control over what is going on. And if you are using reduceByKeyAndWindow without the inverseReduceFunction, then Spark

Re: Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-25 Thread Cody Koeninger

Where are you calling checkpointing? Metadata checkpointing for a kafa direct stream should just be the offsets, not the data. TD can better speak to reduceByKeyAndWindow behavior when restoring from a checkpoint, but ultimately the only available choices would be replay the prior window data

Spark RDD DAG behaviour understanding in case of checkpointing

2016-01-23 Thread gaurav sharma

Hi Tathagata/Cody, I am facing a challenge in Production with DAG behaviour during checkpointing in spark streaming - Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ 100 GB of data Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise processing -

Re: Spark RDD DAG behaviour understanding in case of checkpointing

Re: Spark RDD DAG behaviour understanding in case of checkpointing

Spark RDD DAG behaviour understanding in case of checkpointing

3 matches

Site Navigation

Mail list logo

Footer information