I've been experimenting with MapReduce job using CSV and avro format. What
I find it strange is that Avro format is larger than CSV.

For example, I exported some data in CSV, which is about 1.6GB. I then
wrote a schema and a MapReduce job to take that CSV and serialize and write
the output back to HDFS.

When I checked the file size of the output, it was 2.4GB. I assumed that
the size would be smaller because it convert the data into binary but I was
wrong. Do you know what the reason is and refer me to some documentation on
this?

I've checked the .avro file and I could see that header contains the schema
and the rest are data blocks.

Reply via email to