Hey Cody (et. al.),
Few more questions related to this. It sounds like our missing data issues
appear fixed with this approach. Could you shed some light on a few
questions that came up?
---------------------
Processing our data inside a single foreachPartition function appears to be
very different from the pattern seen in the programming guide. Does this
become problematic with additional, interleaved reduce/filter/map steps?
```
# typical?
rdd
.map { ... }
.reduce { ... }
.filter { ... }
.reduce { ... }
.foreachRdd { writeToDb }
# with foreachPartition
rdd.foreachPartition { case (iter) =>
iter
.map { ... }
.reduce { ... }
.filter { ... }
.reduce { ... }
}
```
---------------------------------
Could the above be simplified by having
one kafka partition per DStream, rather than
one kafka partition per RDD partition
?
That way, we wouldn't need to do our processing inside each partition as
there would only be one set of kafka metadata to commit.
Presumably, one could `join` DStreams when topic-level aggregates were
needed.
It seems this was the approach of Michael Noll in his blog post.
(http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/)
Although, his primary motivation appears to be maintaining high-throughput /
parallelism rather than kafka metadata.
---------------------------------
>From the blog post:
"... there is no long-running receiver task that occupies a core per stream
regardless of what the message volume is."
Is this because data is retrieved by polling rather than maintaining a
socket? Is it still the case that there is only one receiver process per
DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1
.. else discover the machine's NIC limit?
Can you think of a reason not to do this? Cluster utilization, or the like,
perhaps?
--------------------------------
And seems a silly question, but does `foreachPartition` guarantee that a
single worker will process the passed function? Or might two workers split
the work?
Eg. foreachPartition(f)
Worker 1: f( Iterator[partition 1 records 1 - 50] )
Worker 2: f( Iterator[partition 1 records 51 - 100] )
It is unclear from the scaladocs
(https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD).
But you can imagine, if it is critical that this data be committed in a
single transaction, that two workers will have issues.
-- Will O
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p12257.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]