You hit block not found issues when you processing time exceeds the batch
duration (this happens with receiver oriented streaming). If you are
consuming messages from Kafka then try to use the directStream or you can
also set StorageLevel to MEMORY_AND_DISK with receiver oriented consumer.
(This might slow things down a bit though).
Thanks
Best Regards
On Wed, Aug 19, 2015 at 8:21 PM, jlg jgri...@adzerk.com wrote:
Some background on what we're trying to do:
We have four Kinesis receivers with varying amounts of data coming through
them. Ultimately we work on a unioned stream that is getting about 11
MB/second of data. We use a batch size of 5 seconds.
We create four distinct DStreams from this data that have different
aggregation computations (various combinations of
map/flatMap/reduceByKeyAndWindow and then finishing by serializing the
records to JSON strings and writing them to S3). We want to do 30 minute
windows of computations on this data, to get a better compression rate for
the aggregates (there are a lot of repeated keys across this time frame,
and
we want to combine them all -- we do this using reduceByKeyAndWindow).
But even when trying to do 5 minute windows, we have issues with Could not
compute split, block —— not found. This is being run on a YARN cluster and
it seems like the executors are getting killed even though they should have
plenty of memory.
Also, it seems like no computation actually takes place until the end of
the
window duration. This seems inefficient if there is a lot of data that you
know is going to be needed for the computation. Is there any good way
around
this?
There are some of the configuration settings we are using for Spark:
spark.executor.memory=26000M,\
spark.executor.cores=4,\
spark.executor.instances=5,\
spark.driver.cores=4,\
spark.driver.memory=24000M,\
spark.default.parallelism=128,\
spark.streaming.blockInterval=100ms,\
spark.streaming.receiver.maxRate=2,\
spark.akka.timeout=300,\
spark.storage.memoryFraction=0.6,\
spark.rdd.compress=true,\
spark.executor.instances=16,\
spark.serializer=org.apache.spark.serializer.KryoSerializer,\
spark.kryoserializer.buffer.max=2047m,\
Is this the correct way to do this, and how can I further debug to figure
out this issue?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Some-issues-Could-not-compute-split-block-not-found-and-questions-tp24342.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org