subject:"large small files vs one big file in hive table"

large small files vs one big file in hive table

2014-05-05 Thread Shushant Arora

I have a hive table in which data is populated from RDBMS on daily basis. After map reduce each mapper write its data in hive table partitioned at month level. Issue is daily when job runs it fetches data of last day and each mapper writes its output in seperate file. Shall I merge those files in

Re: large small files vs one big file in hive table

2014-05-05 Thread Db-Blog

In general it is recommended to have Millions of Large files rather than billions of small files in hadoop. Please describe your issues in detail. Say for ex. -How are you planning to consume the data stored in this partition table? - Are you looking for storage and performance optimizations?

Re: large small files vs one big file in hive table

2014-05-05 Thread Shushant Arora

Its for performance optimisation .There are 2 requirements 1.I am gonna consume data on daily basis. Gonna run query on hive table and fetch today's incremental data which I got from RDBMS and query on that. 2.Gonna run cumulative distinct user query on whole set. Shall I merge output of each