I have a streaming application that reads from Kakfa (direct stream) and then write parquet files. It is a pretty simple app that gets a Kafka direct stream (8 partitions) and then calls `stream.foreachRdd` and then stores to parquet using a Dataframe. Batch intervals are set to 10 seconds. During the storage I use `partitionBy` so the data can be order by time and client.
When running the app and storing the data to a local FS the performance is somewhat acceptable (~1800 events in 3 seconds). If I point the destination to Google Cloud storage using the GCS connector the same number of records takes about 4 minutes. All other things are equal except for the file destination. Has anyone tried to go from streaming directly to GCS or S3 and overcome the unacceptable performance. It can never keep up. Thanks, Ivan -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Streaming-and-storing-to-Google-Cloud-Storage-or-S3-tp14664.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org