My understanding is that snappy is a block compression scheme. When using HDFS sink with snappy, I am wondering if 1 batch of events corresponds to 1 compressed chunk in the snappy file ?
This is interesting in the face of HDFS failures... if the sink is int the middle of writing a batch when Hdfs connection has an error, then we have a partially written snappy file. If each flume sink batch corresponds to one snappy chunk, then only the last chunk in the snappy file will be unreadable and that's ok.. Since that last batch will be redelivered to another file. However if multiple batches end up in a single snappy chunk then the last few batches will be unrecoverable from the snappy file... leading to data loss. -roshan
