Hello,

the file format topic is still confusing me and I would appreciate if you
could share your thoughts and experience with me.

>From reading different books/articles/websites I understand that
- Sequence files (used frequently but not only for binary data),
- AVRO,
- RC (was developed to work best with Hive -columnar storage) and
- ORC (a successor of RC to give Hive another performance boost - Stinger
initiative)
are all container file formats to solve the "small files problem" and all
support compression and splitting.
Additionally, each file format was developed with specific features/benefits
in mind.

Imagine I have the following text source data
- 1 TB of XML documents (some millions of small files)
- 1 TB of JSON documents (some hundred thousands of medium sized files)
- 1 TB of Apache log files (some thousands of bigger files)

How should I store this data in HDFS to process it using Java MapReduce and
Pig and Hive? 
I want to use the best tool for my specific problem - with "best"
performance of course - i.e. maybe one problem on the apache log data can be
best solved using Java MapReduce, another one using Hive or Pig.

Should I simply put the data into HDFS as the data comes from - i.e. as
plain text files?
Or should I convert all my data to a container file format like sequence
files, AVRO, RC or ORC?

Based on this example, I believe 
- the XML documents will be need to be converted to a container file format
to overcome the "small files problem".
- the JSON documents could/should not be affected by the "small files
problem"
- the Apache files should definitely not be affected by the "small files
problem", so they could be stored as plain text files.

So, some source data needs to be converted to a container file format,
others not necessarily.
But what is really advisable?

Is it advisable to store all data (XML, JSON, Apache logs) in one specific
container file format in the cluster- let's say you decide to use sequence
files?
Having only one file format in HDFS is of course a benefit in terms of
managing the files and writing Java MapReduce/Pig/Hive code against it.
Sequence files in this case is certainly not a bad idea, but Hive queries
could probably better benefit from let's say RC/ORC.

Therefore, is it better to use a mix of plain text files and/or one or more
container file formats simultaneously?

I know that there will be no crystal-clear answer here as it always
"depends", but what approach should be taken here, or what is usually used
in the community out there?

I welcome any feedback and experiences you made.

Thanks

Reply via email to