Hi all,
I am running some benchmarks on a simple Spark application which consists of
:
- textFileStream() to extract text records from HDFS files
- map() to parse records into JSON objects
- updateStateByKey() to calculate and store an in-memory state for each key.
The processing time per batch
Hi all,
We are looking for a tool that would let us visualize the DAG generated by a
Spark application as a simple graph.
This graph would represent the Spark Job, its stages and the tasks inside
the stages, with the dependencies between them (either narrow or shuffle
dependencies).
The Spark
For anybody who's interested in this, here's a link to a PR that addresses
this feature :
https://github.com/apache/spark/pull/2077
(thanks to Todd Nist for sending it to me)
--
View this message in context:
Hi all,
I am trying to create RDDs from within /rdd.foreachPartition()/ so I can
save these RDDs to ElasticSearch on the fly :
stream.foreachRDD(rdd = {
rdd.foreachPartition {
iterator = {
val sc = rdd.context
iterator.foreach {
case (cid,
Hi all,
Spark Streaming occasionally (not always) hangs indefinitely on my program
right after the first batch has been processed.
As you can see in the following screenshots of the Spark Streaming
monitoring UI, it hangs on the map stages that correspond (I assume) to the
second batch that is
Here's what we've tried so far as a first example of a custom Mongo receiver
:
/class MongoStreamReceiver(host: String)
extends NetworkReceiver[String] {
protected lazy val blocksGenerator: BlockGenerator =
new BlockGenerator(StorageLevel.MEMORY_AND_DISK_SER_2)
protected def onStart()