Re: unable to stream kafka messages

2017-08-24 Thread cbowden
The exception is telling you precisely what is wrong. The kafka source has a schema of (topic, partition, offset, key, value, timestamp, timestampType). Nothing about those columns makes sense as a tweet. You need to inform spark how to get from bytes to tweet, it doesn't know how you serialized

Re: [Spark Structured Streaming]: truncated Parquet after driver crash or kill

2017-08-24 Thread cbowden
The default spark.sql.streaming.commitProtocolClass is https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala which may or may not be the best suited for all needs. Code deploys could be improved by ensuring

Re: Structured Streaming: multiple sinks

2017-08-24 Thread cbowden
1. would it not be more natural to write processed to kafka and sink processed from kafka to s3? 2a. addBatch is the time Sink#addBatch took as measured by StreamExecution. 2b. getBatch is the time Source#getBatch took as measured by StreamExecution. 3. triggerExecution is effectively end-to-end

Re: [Streaming][Structured Streaming] Understanding dynamic allocation in streaming jobs

2017-08-24 Thread cbowden
You can leverage dynamic resource allocation with structured streaming. Certainly there's an argument trivial jobs won't benefit. Certainly there's an argument important jobs should have fixed resources for stable end to end latency. Few scenarios come to mind with benefits: - I want my