We are currently using Camus for Kafka to HDFS pipeline to store as SequenceFiles but I understand Spark Streaming can be used to save as Parquet. As I read about Parquet, the layout is optimized for queries against large file sizes. Are there any options in Spark to specify the block size to help with this or it is dependent on the specified time window?
Thanks! -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-HDFS-to-store-as-Parquet-format-tp15768.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org