The exception is telling you precisely what is wrong. The kafka source has a
schema of (topic, partition, offset, key, value, timestamp, timestampType).
Nothing about those columns makes sense as a tweet. You need to inform spark
how to get from bytes to tweet, it doesn't know how you serialized
The default spark.sql.streaming.commitProtocolClass is
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala
which may or may not be the best suited for all needs.
Code deploys could be improved by ensuring
1. would it not be more natural to write processed to kafka and sink
processed from kafka to s3?
2a. addBatch is the time Sink#addBatch took as measured by StreamExecution.
2b. getBatch is the time Source#getBatch took as measured by
StreamExecution.
3. triggerExecution is effectively end-to-end
You can leverage dynamic resource allocation with structured streaming.
Certainly there's an argument trivial jobs won't benefit. Certainly there's
an argument important jobs should have fixed resources for stable end to end
latency.
Few scenarios come to mind with benefits:
- I want my