Putting each document into a separate file is not likely to be a great thing to do.
On the other hand, putting them all into one file may not be what you want either. It is probably best to find a middle ground and create files each with many documents and each a few gigabytes in size. On Fri, Mar 29, 2013 at 1:15 PM, <pathu...@yahoo.com> wrote: > If there r 1 million docs in an enterprse and we need to perform word > count computation on all the docs what is the first step to be done. Is it > to extract all the text of all the docs into a single file and then put > into hdfs or put each one separately in hdfs. > Thanks > > Sent from BlackBerry® on Airtel