Re: merging small files in HDFS
Hi Here's a couple more alternatives. _If the goal is __writing the least amount of code_, I'd look into using hive. Create an external table over the dir with lots of small data files, and another external table over the dir where I want the compacted data files. Select * from one table and insert it into the other. Hive will use CombineFileInputFormat, and you don't have to subclass it to supply the record reader. For _best performance_, I'd go for a map-only job, with an input format like NLineInputFormat, and a custom Map. The general idea is to have each mapper receive a number of data file *names*, and "cat" those data files explicitly. (if they're text files, you can stream the bytes raw; otherwise use an inner input format/record reader). Here are some details: * List all the data files' names into a text file. o this is the input to the map-only job o hadoop fs -ls > file-list.txt * InputFormat: o you want to get as many splits as the desired number of output files + the number is a tradeoff between how few files you want and how fast you want this step to be. + if you want 1 file, then skip to "Mapper" below. o if the data file sizes don't vary wildly in size, + have each split consist of k lines (where k = #input files / # output files) o if data files size a very different, you need to override getSplits() to implement some simple bin-packing approx algorithm to group the files such the total size in each group is roughly the same. For instance, see https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm (the generalized version). * Mapper o the input values: Text, each a name of a data file o if data files are text files: + create output file, + for each input value, open the data file with that name, stream it into the output file; + (you may need to add \n after each data file not ending in \n) + close the output file on Map::cleanup() o for arbitrary data formats: + you need to explicitly handle an inner input format/record reader to read from each data file + for each input value (a data file name), # make new conf, set mapred input dir to the data file's name. # have the inner input format give you a split # have the inner input format give you a record reader for that split # iterate over the record reader's k-v pairs, outputting them into to mapper's output. # (you need to set the output format appropriately) my 2c Gabriel Balan On 12/30/2016 3:57 PM, Chris Nauroth wrote: Hello Piyush, I would typically accomplish this sort of thing by using CombineFileInputFormat, which is capable of combining multiple small files into a single input split. http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html This prevents launching a huge number of map tasks with each one performing just a little bit of work to process each small file. The job could use the standard pass-through IdentityMapper, so that output records are identical to the input records. http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html The same data will be placed into a smaller number of files at the destination. The number of files can be controlled by setting the job's number of reducers. This is something you can tune toward your targeted trade-off of number o -files vs. size of each file. Then, you can adjust this pattern if you have additional data preparation requirements such as compressing the output. I hope this helps. --Chris On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.muk...@gmail.com <mailto:piyush.muk...@gmail.com>> wrote: Hi, thanks for the suggestion. "hadoop fs -getmerge" is a good and simple solution for one time activity on few directory. But It may have problems at scale as this solution copy the data to local from hdfs and then put it back to hdfs. Also here we have to take care of compressing and decompressing separately . we need to run this merge every hour for thousands of directories. On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com <mailto:senthiku...@ebay.com>> wrote: Can't we use getmerge here ? If you requirement is to merge some files in a particular directory to single file .. hadoop fs -getmerge --Senthil -Original Message- From: Giovanni Mascari [mailto:giovanni.masc...@polito.it <mailto:giovanni.masc...@polito.it>] Sent: Thursday, November 03, 2016 7:24 PM To: Piyush Mukati <piyush.muk...@gmail.com <mailto:piyush.muk...@gmail.com>>; user@hadoop.apache.org <mailto:user@hadoop.apache.org> Subject: Re: merging small f
Re: merging small files in HDFS
Hello Piyush, I would typically accomplish this sort of thing by using CombineFileInputFormat, which is capable of combining multiple small files into a single input split. http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html This prevents launching a huge number of map tasks with each one performing just a little bit of work to process each small file. The job could use the standard pass-through IdentityMapper, so that output records are identical to the input records. http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html The same data will be placed into a smaller number of files at the destination. The number of files can be controlled by setting the job's number of reducers. This is something you can tune toward your targeted trade-off of number o -files vs. size of each file. Then, you can adjust this pattern if you have additional data preparation requirements such as compressing the output. I hope this helps. --Chris On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.muk...@gmail.com> wrote: > Hi, > thanks for the suggestion. > "hadoop fs -getmerge" is a good and simple solution for one time activity > on few directory. > But It may have problems at scale as this solution copy the data to local > from hdfs and then put it back to hdfs. > Also here we have to take care of compressing and decompressing > separately . > we need to run this merge every hour for thousands of directories. > > > > On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com> > wrote: > >> Can't we use getmerge here ? If you requirement is to merge some files >> in a particular directory to single file .. >> >> hadoop fs -getmerge >> >> --Senthil >> -Original Message- >> From: Giovanni Mascari [mailto:giovanni.masc...@polito.it] >> Sent: Thursday, November 03, 2016 7:24 PM >> To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org >> Subject: Re: merging small files in HDFS >> >> Hi, >> if I correctly understand your request you need only to merge some data >> resulting from an hdfs write operation. >> In this case, I suppose that your best option is to use hadoop-stream >> with 'cat' command. >> >> take a look here: >> https://hadoop.apache.org/docs/r1.2.1/streaming.html >> >> Regards >> >> Il 03/11/2016 13:53, Piyush Mukati ha scritto: >> > Hi, >> > I want to merge multiple files in one HDFS dir to one file. I am >> > planning to write a map only job using input format which will create >> > only one inputSplit per dir. >> > this way my job don't need to do any shuffle/sort.(only read and write >> > back to disk) Is there any such file format already implemented ? >> > Or any there better solution for the problem. >> > >> > thanks. >> > >> >> - >> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org >> For additional commands, e-mail: user-h...@hadoop.apache.org >> >> > -- Chris Nauroth
Re: merging small files in HDFS
Hi, thanks for the suggestion. "hadoop fs -getmerge" is a good and simple solution for one time activity on few directory. But It may have problems at scale as this solution copy the data to local from hdfs and then put it back to hdfs. Also here we have to take care of compressing and decompressing separately . we need to run this merge every hour for thousands of directories. On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com> wrote: > Can't we use getmerge here ? If you requirement is to merge some files in > a particular directory to single file .. > > hadoop fs -getmerge > > --Senthil > -Original Message- > From: Giovanni Mascari [mailto:giovanni.masc...@polito.it] > Sent: Thursday, November 03, 2016 7:24 PM > To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org > Subject: Re: merging small files in HDFS > > Hi, > if I correctly understand your request you need only to merge some data > resulting from an hdfs write operation. > In this case, I suppose that your best option is to use hadoop-stream with > 'cat' command. > > take a look here: > https://hadoop.apache.org/docs/r1.2.1/streaming.html > > Regards > > Il 03/11/2016 13:53, Piyush Mukati ha scritto: > > Hi, > > I want to merge multiple files in one HDFS dir to one file. I am > > planning to write a map only job using input format which will create > > only one inputSplit per dir. > > this way my job don't need to do any shuffle/sort.(only read and write > > back to disk) Is there any such file format already implemented ? > > Or any there better solution for the problem. > > > > thanks. > > > > - > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > For additional commands, e-mail: user-h...@hadoop.apache.org > >
Re: merging small files in HDFS
Hi , You need to write a map method to just parse input file and pass it to reducer.. use only reducer..so that all maps output will go to one reducer and one file gets created,which is merge of input files.. On 03-Nov-2016 8:54 pm, "Piyush Mukati"wrote: > Hi, > I want to merge multiple files in one HDFS dir to one file. I am planning > to write a map only job using input format which will create only one > inputSplit per dir. > this way my job don't need to do any shuffle/sort.(only read and write > back to disk) > Is there any such file format already implemented ? > Or any there better solution for the problem. > > thanks. > >
Re: merging small files in HDFS
Will key value based sequence file format work for you? You can keep KEY as name of your small file and VALUE as content. Sequence files can be passed as input to other jobs too. [0] can be a code reference which converts many small files into a big sequence file in mapreduce fashion. [1] is a good blogpost about it. getmerge will work too just that it will merge it on local fs and you will have to copy it back to hdfs. It's best though if it's a one time activity, file count isn't huge you want to merge file content not knowing where one file ends and other start. [0] - Code snippet - https://github.com/USCDataScience/hadoop-pot/blob/master/hadoop-pot-core/src/main/java/org/pooledtimeseries/seqfile/TextVectorsToSequenceFile.java [1] - Blog for handling small files - http://blog.cloudera.com/blog/2009/02/the-small-files-problem/ Cheers! -- Madhav Sharan On Thu, Nov 3, 2016 at 6:58 AM, kumar, Senthil(AWF) <senthiku...@ebay.com> wrote: > Can't we use getmerge here ? If you requirement is to merge some files in > a particular directory to single file .. > > > > hadoop fs -getmerge > > > > --Senthil > > -Original Message- > > From: Giovanni Mascari [mailto:giovanni.masc...@polito.it] > > Sent: Thursday, November 03, 2016 7:24 PM > > To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org > > Subject: Re: merging small files in HDFS > > > > Hi, > > if I correctly understand your request you need only to merge some data > resulting from an hdfs write operation. > > In this case, I suppose that your best option is to use hadoop-stream with > 'cat' command. > > > > take a look here: > > https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop. > apache.org_docs_r1.2.1_streaming.html=DgIGaQ= > clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=DhBa2eLkbd4gAFB01lkNgg= > V5uaM4YBn9uAuMzligWHKmh7D528KYMeolDp8EPrzSw=KTXzdXM2hkUMAShawrin_ > ngnnnFq3SOsG7OH7ECSIrc= > > > > Regards > > > > Il 03/11/2016 13:53, Piyush Mukati ha scritto: > > > Hi, > > > I want to merge multiple files in one HDFS dir to one file. I am > > > planning to write a map only job using input format which will create > > > only one inputSplit per dir. > > > this way my job don't need to do any shuffle/sort.(only read and write > > > back to disk) Is there any such file format already implemented ? > > > Or any there better solution for the problem. > > > > > > thanks. > > > > > > > - > > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > > For additional commands, e-mail: user-h...@hadoop.apache.org > > > >
RE: merging small files in HDFS
Can't we use getmerge here ? If you requirement is to merge some files in a particular directory to single file .. hadoop fs -getmerge --Senthil -Original Message- From: Giovanni Mascari [mailto:giovanni.masc...@polito.it] Sent: Thursday, November 03, 2016 7:24 PM To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org Subject: Re: merging small files in HDFS Hi, if I correctly understand your request you need only to merge some data resulting from an hdfs write operation. In this case, I suppose that your best option is to use hadoop-stream with 'cat' command. take a look here: https://hadoop.apache.org/docs/r1.2.1/streaming.html Regards Il 03/11/2016 13:53, Piyush Mukati ha scritto: > Hi, > I want to merge multiple files in one HDFS dir to one file. I am > planning to write a map only job using input format which will create > only one inputSplit per dir. > this way my job don't need to do any shuffle/sort.(only read and write > back to disk) Is there any such file format already implemented ? > Or any there better solution for the problem. > > thanks. > - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
Re: merging small files in HDFS
Hi, if I correctly understand your request you need only to merge some data resulting from an hdfs write operation. In this case, I suppose that your best option is to use hadoop-stream with 'cat' command. take a look here: https://hadoop.apache.org/docs/r1.2.1/streaming.html Regards Il 03/11/2016 13:53, Piyush Mukati ha scritto: Hi, I want to merge multiple files in one HDFS dir to one file. I am planning to write a map only job using input format which will create only one inputSplit per dir. this way my job don't need to do any shuffle/sort.(only read and write back to disk) Is there any such file format already implemented ? Or any there better solution for the problem. thanks. - To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org For additional commands, e-mail: user-h...@hadoop.apache.org
merging small files in HDFS
Hi, I want to merge multiple files in one HDFS dir to one file. I am planning to write a map only job using input format which will create only one inputSplit per dir. this way my job don't need to do any shuffle/sort.(only read and write back to disk) Is there any such file format already implemented ? Or any there better solution for the problem. thanks.