Re: Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-25 Thread Akhil Das
You hit block not found issues when you processing time exceeds the batch
duration (this happens with receiver oriented streaming). If you are
consuming messages from Kafka then try to use the directStream or you can
also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer.
(This might slow things down a bit though).

Thanks
Best Regards

On Wed, Aug 19, 2015 at 8:21 PM, jlg jgri...@adzerk.com wrote:

 Some background on what we're trying to do:

 We have four Kinesis receivers with varying amounts of data coming through
 them. Ultimately we work on a unioned stream that is getting about 11
 MB/second of data. We use a batch size of 5 seconds.

 We create four distinct DStreams from this data that have different
 aggregation computations (various combinations of
 map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
 records to JSON strings and writing them to S3). We want to do 30 minute
 windows of computations on this data, to get a better compression rate for
 the aggregates (there are a lot of repeated keys across this time frame,
 and
 we want to combine them all -- we do this using reduceByKeyAndWindow).

 But even when trying to do 5 minute windows, we have issues with Could not
 compute split, block —— not found. This is being run on a YARN cluster and
 it seems like the executors are getting killed even though they should have
 plenty of memory.

 Also, it seems like no computation actually takes place until the end of
 the
 window duration. This seems inefficient if there is a lot of data that you
 know is going to be needed for the computation. Is there any good way
 around
 this?

 There are some of the configuration settings we are using for Spark:

 spark.executor.memory=26000M,\
 spark.executor.cores=4,\
 spark.executor.instances=5,\
 spark.driver.cores=4,\
 spark.driver.memory=24000M,\
 spark.default.parallelism=128,\
 spark.streaming.blockInterval=100ms,\
 spark.streaming.receiver.maxRate=2,\
 spark.akka.timeout=300,\
 spark.storage.memoryFraction=0.6,\
 spark.rdd.compress=true,\
 spark.executor.instances=16,\
 spark.serializer=org.apache.spark.serializer.KryoSerializer,\
 spark.kryoserializer.buffer.max=2047m,\


 Is this the correct way to do this, and how can I further debug to figure
 out this issue?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark Streaming: Some issues (Could not compute split, block —— not found) and questions

2015-08-19 Thread jlg
Some background on what we're trying to do:

We have four Kinesis receivers with varying amounts of data coming through
them. Ultimately we work on a unioned stream that is getting about 11
MB/second of data. We use a batch size of 5 seconds. 

We create four distinct DStreams from this data that have different
aggregation computations (various combinations of
map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
records to JSON strings and writing them to S3). We want to do 30 minute
windows of computations on this data, to get a better compression rate for
the aggregates (there are a lot of repeated keys across this time frame, and
we want to combine them all -- we do this using reduceByKeyAndWindow). 

But even when trying to do 5 minute windows, we have issues with Could not
compute split, block —— not found. This is being run on a YARN cluster and
it seems like the executors are getting killed even though they should have
plenty of memory. 

Also, it seems like no computation actually takes place until the end of the
window duration. This seems inefficient if there is a lot of data that you
know is going to be needed for the computation. Is there any good way around
this?

There are some of the configuration settings we are using for Spark:

spark.executor.memory=26000M,\
spark.executor.cores=4,\
spark.executor.instances=5,\
spark.driver.cores=4,\
spark.driver.memory=24000M,\
spark.default.parallelism=128,\
spark.streaming.blockInterval=100ms,\
spark.streaming.receiver.maxRate=2,\
spark.akka.timeout=300,\
spark.storage.memoryFraction=0.6,\
spark.rdd.compress=true,\
spark.executor.instances=16,\
spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark.kryoserializer.buffer.max=2047m,\


Is this the correct way to do this, and how can I further debug to figure
out this issue? 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org