I have a streaming application that reads from Kakfa (direct stream) and then
write parquet files. It is a pretty simple app that gets a Kafka direct
stream (8 partitions) and then calls `stream.foreachRdd` and then stores to
parquet using a Dataframe. Batch intervals are set to 10 seconds. During the
storage I use `partitionBy` so the data can be order by time and client.

When running the app and storing the data to a local FS the performance is
somewhat acceptable (~1800 events in 3 seconds). If I point the destination
to Google Cloud storage using the GCS connector the same number of records
takes about 4 minutes. All other things are equal except for the file
destination.

Has anyone tried to go from streaming directly to GCS or S3 and overcome the
unacceptable performance. It can never keep up.

Thanks,

Ivan 



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Streaming-and-storing-to-Google-Cloud-Storage-or-S3-tp14664.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to