Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

will-ob Thu, 14 May 2015 13:45:14 -0700

Hey Cody (et. al.),

Few more questions related to this. It sounds like our missing data issues
appear fixed with this approach. Could you shed some light on a few
questions that came up?


---------------------

Processing our data inside a single foreachPartition function appears to be
very different from the pattern seen in the programming guide. Does this
become problematic with additional, interleaved reduce/filter/map steps?

```
# typical?
rdd
  .map { ... }
  .reduce { ... }
  .filter { ... }
  .reduce { ... }
  .foreachRdd { writeToDb }

# with foreachPartition
rdd.foreachPartition { case (iter) =>
  iter
    .map { ... }
    .reduce { ... }
    .filter { ... }
    .reduce { ... }
}

```
---------------------------------

Could the above be simplified by having

one kafka partition per DStream, rather than
one kafka partition per RDD partition

?

That way, we wouldn't need to do our processing inside each partition as
there would only be one set of kafka metadata to commit.

Presumably, one could `join` DStreams when topic-level aggregates were
needed.

It seems this was the approach of Michael Noll in his blog post.
(http://www.michael-noll.com/blog/2014/10/01/kafka-spark-streaming-integration-example-tutorial/)
Although, his primary motivation appears to be maintaining high-throughput /
parallelism rather than kafka metadata.

---------------------------------

>From the blog post:

"... there is no long-running receiver task that occupies a core per stream
regardless of what the message volume is."

Is this because data is retrieved by polling rather than maintaining a
socket? Is it still the case that there is only one receiver process per
DStream? If so, maybe it is wise to keep DStreams and Kafka partitions 1:1
.. else discover the machine's NIC limit?

Can you think of a reason not to do this? Cluster utilization, or the like,
perhaps?

--------------------------------

And seems a silly question, but does `foreachPartition` guarantee that a
single worker will process the passed function? Or might two workers split
the work?

Eg. foreachPartition(f)

Worker 1:     f( Iterator[partition 1 records 1 - 50] )
Worker 2:     f( Iterator[partition 1 records 51 - 100] )

It is unclear from the scaladocs
(https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD).
But you can imagine, if it is critical that this data be committed in a
single transaction, that two workers will have issues.



-- Will O




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/practical-usage-of-the-new-exactly-once-supporting-DirectKafkaInputDStream-tp11916p12257.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: practical usage of the new "exactly-once" supporting DirectKafkaInputDStream

Reply via email to