I just wanted to check how do people design their storage directories for
data that is sent to the system continuously. For eg: for a given
functionality we get data feed continuously writen to sequencefile, that is
then coverted to more structured format using map reduce and stored in tab
separated files. For such continuous feed what's the best way to organize
directories and the names? Should it be just based of timestamp or
something better that helps in organizing data.

Second part of question, is it better to store output in sequence files so
that we can take advantage of compression per record. This seems to be
required since gzip/snappy compression of entire file would launch only one
map tasks.

And the last question, when compressing a flat file should it first be
split into multiple files so that we get multiple mappers if we need to run
another job on this file? LZO is another alternative but then it requires
additional configuration, is it preferred?

Any articles or suggestions would be very helpful.

Reply via email to