I have a hive table in which data is populated from RDBMS on daily basis.
After map reduce each mapper write its data in hive table partitioned at
month level.
Issue is daily when job runs it fetches data of last day and each mapper
writes its output in seperate file. Shall I merge those files in
In general it is recommended to have Millions of Large files rather than
billions of small files in hadoop.
Please describe your issues in detail. Say for ex.
-How are you planning to consume the data stored in this partition table?
- Are you looking for storage and performance optimizations?
Its for performance optimisation .There are 2 requirements
1.I am gonna consume data on daily basis. Gonna run query on hive table and
fetch today's incremental data which I got from RDBMS and query on that.
2.Gonna run cumulative distinct user query on whole set.
Shall I merge output of each