Re: merging small files in HDFS

2017-01-09 Thread Gabriel Balan

Hi

Here's a couple more alternatives.

_If the goal is __writing the least amount of code_, I'd look into using hive. 
Create an external table over the dir with lots of small data files, and 
another external table over the dir where I want the compacted data files. 
Select * from one table and insert it into the other.

   Hive will use CombineFileInputFormat, and you don't have to subclass it to 
supply the record reader.


For _best performance_, I'd go for a map-only job, with an input format like 
NLineInputFormat, and a custom Map. The general idea is to have each mapper receive a 
number of data file *names*, and "cat" those data files explicitly. (if they're 
text files, you can stream the bytes raw; otherwise use an inner input format/record 
reader).

Here are some details:

 * List all the data files' names into a text file.
 o this is the input to the map-only job
 o hadoop fs -ls  > file-list.txt

 * InputFormat:
 o you want to get as many splits as the desired number of output files
 + the number is a tradeoff between how few files you want and how fast 
you want this step to be.
 + if you want 1 file, then skip to "Mapper" below.
 o if the data file sizes don't vary wildly in size,
 + have each split consist of k lines (where k = #input files / # 
output files)
 o if data files size a very different, you need to override getSplits() to 
implement some simple bin-packing approx algorithm to group the files such the 
total size in each group is roughly the same. For instance, see 
https://en.wikipedia.org/wiki/Partition_problem#The_greedy_algorithm (the 
generalized version).
 * Mapper
 o the input values: Text, each a name of a data file
 o if data files are text files:
 + create output file,
 + for each input value, open the data file with that name, stream it 
into the output file;
 + (you may need to add \n after each data file not ending in \n)
 + close the output file on Map::cleanup()
 o for arbitrary data formats:
 + you need to explicitly handle an inner input format/record reader to 
read from each data file
 + for each input value (a data file name),
 # make new conf, set mapred input dir to the data file's name.
 # have the inner input format give you a split
 # have the inner input format give you a record reader for that 
split
 # iterate over the record reader's k-v pairs, outputting them into 
to mapper's output.
 # (you need to set the output format appropriately)


my 2c

Gabriel Balan


On 12/30/2016 3:57 PM, Chris Nauroth wrote:

Hello Piyush,

I would typically accomplish this sort of thing by using 
CombineFileInputFormat, which is capable of combining multiple small files into 
a single input split.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

This prevents launching a huge number of map tasks with each one performing 
just a little bit of work to process each small file.  The job could use the 
standard pass-through IdentityMapper, so that output records are identical to 
the input records.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

The same data will be placed into a smaller number of files at the destination. 
 The number of files can be controlled by setting the job's number of reducers. 
 This is something you can tune toward your targeted trade-off of number o 
-files vs. size of each file.

Then, you can adjust this pattern if you have additional data preparation 
requirements such as compressing the output.

I hope this helps.

--Chris

On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.muk...@gmail.com 
<mailto:piyush.muk...@gmail.com>> wrote:

Hi,
thanks for the suggestion.
"hadoop fs -getmerge"  is a good and simple solution for one time activity 
on few directory.
 But It may have problems at scale as this solution copy the data to local 
from hdfs and then put it back to hdfs.
 Also here we have to take care of compressing and decompressing separately 
.
we need to run this merge every hour for thousands of directories.



On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com 
<mailto:senthiku...@ebay.com>> wrote:

Can't we use getmerge here ?  If you requirement is to merge some files 
in a particular directory to single file ..

hadoop fs -getmerge  

--Senthil
-Original Message-
From: Giovanni Mascari [mailto:giovanni.masc...@polito.it 
<mailto:giovanni.masc...@polito.it>]
Sent: Thursday, November 03, 2016 7:24 PM
To: Piyush Mukati <piyush.muk...@gmail.com <mailto:piyush.muk...@gmail.com>>; 
user@hadoop.apache.org <mailto:user@hadoop.apache.org>
Subject: Re: merging small f

Re: merging small files in HDFS

2016-12-30 Thread Chris Nauroth
Hello Piyush,

I would typically accomplish this sort of thing by using
CombineFileInputFormat, which is capable of combining multiple small files
into a single input split.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapreduce/lib/input/CombineFileInputFormat.html

This prevents launching a huge number of map tasks with each one performing
just a little bit of work to process each small file.  The job could use
the standard pass-through IdentityMapper, so that output records are
identical to the input records.

http://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/mapred/lib/IdentityMapper.html

The same data will be placed into a smaller number of files at the
destination.  The number of files can be controlled by setting the job's
number of reducers.  This is something you can tune toward your targeted
trade-off of number o -files vs. size of each file.

Then, you can adjust this pattern if you have additional data preparation
requirements such as compressing the output.

I hope this helps.

--Chris

On Thu, Nov 3, 2016 at 10:34 PM, Piyush Mukati <piyush.muk...@gmail.com>
wrote:

> Hi,
> thanks for the suggestion.
> "hadoop fs -getmerge"  is a good and simple solution for one time activity
> on few directory.
>  But It may have problems at scale as this solution copy the data to local
> from hdfs and then put it back to hdfs.
>  Also here we have to take care of compressing and decompressing
> separately .
> we need to run this merge every hour for thousands of directories.
>
>
>
> On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com>
> wrote:
>
>> Can't we use getmerge here ?  If you requirement is to merge some files
>> in a particular directory to single file ..
>>
>> hadoop fs -getmerge  
>>
>> --Senthil
>> -Original Message-
>> From: Giovanni Mascari [mailto:giovanni.masc...@polito.it]
>> Sent: Thursday, November 03, 2016 7:24 PM
>> To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org
>> Subject: Re: merging small files in HDFS
>>
>> Hi,
>> if I correctly understand your request you need only to merge some data
>> resulting from an hdfs write operation.
>> In this case, I suppose that your best option is to use hadoop-stream
>> with 'cat' command.
>>
>> take a look here:
>> https://hadoop.apache.org/docs/r1.2.1/streaming.html
>>
>> Regards
>>
>> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>> > Hi,
>> > I want to merge multiple files in one HDFS dir to one file. I am
>> > planning to write a map only job using input format which will create
>> > only one inputSplit per dir.
>> > this way my job don't need to do any shuffle/sort.(only read and write
>> > back to disk) Is there any such file format already implemented ?
>> > Or any there better solution for the problem.
>> >
>> > thanks.
>> >
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
>> For additional commands, e-mail: user-h...@hadoop.apache.org
>>
>>
>


-- 
Chris Nauroth


Re: merging small files in HDFS

2016-11-03 Thread Piyush Mukati
Hi,
thanks for the suggestion.
"hadoop fs -getmerge"  is a good and simple solution for one time activity
on few directory.
 But It may have problems at scale as this solution copy the data to local
from hdfs and then put it back to hdfs.
 Also here we have to take care of compressing and decompressing separately
.
we need to run this merge every hour for thousands of directories.



On Thu, Nov 3, 2016 at 7:28 PM, kumar, Senthil(AWF) <senthiku...@ebay.com>
wrote:

> Can't we use getmerge here ?  If you requirement is to merge some files in
> a particular directory to single file ..
>
> hadoop fs -getmerge  
>
> --Senthil
> -Original Message-
> From: Giovanni Mascari [mailto:giovanni.masc...@polito.it]
> Sent: Thursday, November 03, 2016 7:24 PM
> To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org
> Subject: Re: merging small files in HDFS
>
> Hi,
> if I correctly understand your request you need only to merge some data
> resulting from an hdfs write operation.
> In this case, I suppose that your best option is to use hadoop-stream with
> 'cat' command.
>
> take a look here:
> https://hadoop.apache.org/docs/r1.2.1/streaming.html
>
> Regards
>
> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
> > Hi,
> > I want to merge multiple files in one HDFS dir to one file. I am
> > planning to write a map only job using input format which will create
> > only one inputSplit per dir.
> > this way my job don't need to do any shuffle/sort.(only read and write
> > back to disk) Is there any such file format already implemented ?
> > Or any there better solution for the problem.
> >
> > thanks.
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: merging small files in HDFS

2016-11-03 Thread dileep kumar
Hi ,

You need to write a map method to just parse input file and pass it to
reducer.. use only reducer..so that all maps output will go to one reducer
and one file gets created,which is merge of input files..

On 03-Nov-2016 8:54 pm, "Piyush Mukati"  wrote:

> Hi,
> I want to merge multiple files in one HDFS dir to one file. I am planning
> to write a map only job using input format which will create only one
> inputSplit per dir.
> this way my job don't need to do any shuffle/sort.(only read and write
> back to disk)
> Is there any such file format already implemented ?
> Or any there better solution for the problem.
>
> thanks.
>
>


Re: merging small files in HDFS

2016-11-03 Thread Madhav Sharan
Will key value based sequence file format work for you? You can keep KEY as
name of your small file and VALUE as content. Sequence files can be passed
as input to other jobs too.

[0] can be a code reference which converts many small files into a big
sequence file in mapreduce fashion. [1] is a good blogpost about it.

getmerge will work too just that it will merge it on local fs and you will
have to copy it back to hdfs. It's best though if it's a one time activity,
file count isn't huge you want to merge file content not knowing where one
file ends and other start.

[0] - Code snippet -
https://github.com/USCDataScience/hadoop-pot/blob/master/hadoop-pot-core/src/main/java/org/pooledtimeseries/seqfile/TextVectorsToSequenceFile.java

[1] - Blog for handling small files -
http://blog.cloudera.com/blog/2009/02/the-small-files-problem/

Cheers!

--
Madhav Sharan


On Thu, Nov 3, 2016 at 6:58 AM, kumar, Senthil(AWF) <senthiku...@ebay.com>
wrote:

> Can't we use getmerge here ?  If you requirement is to merge some files in
> a particular directory to single file ..
>
>
>
> hadoop fs -getmerge  
>
>
>
> --Senthil
>
> -Original Message-
>
> From: Giovanni Mascari [mailto:giovanni.masc...@polito.it]
>
> Sent: Thursday, November 03, 2016 7:24 PM
>
> To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org
>
> Subject: Re: merging small files in HDFS
>
>
>
> Hi,
>
> if I correctly understand your request you need only to merge some data
> resulting from an hdfs write operation.
>
> In this case, I suppose that your best option is to use hadoop-stream with
> 'cat' command.
>
>
>
> take a look here:
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__hadoop.
> apache.org_docs_r1.2.1_streaming.html=DgIGaQ=
> clK7kQUTWtAVEOVIgvi0NU5BOUHhpN0H8p7CSfnc_gI=DhBa2eLkbd4gAFB01lkNgg=
> V5uaM4YBn9uAuMzligWHKmh7D528KYMeolDp8EPrzSw=KTXzdXM2hkUMAShawrin_
> ngnnnFq3SOsG7OH7ECSIrc=
>
>
>
> Regards
>
>
>
> Il 03/11/2016 13:53, Piyush Mukati ha scritto:
>
> > Hi,
>
> > I want to merge multiple files in one HDFS dir to one file. I am
>
> > planning to write a map only job using input format which will create
>
> > only one inputSplit per dir.
>
> > this way my job don't need to do any shuffle/sort.(only read and write
>
> > back to disk) Is there any such file format already implemented ?
>
> > Or any there better solution for the problem.
>
> >
>
> > thanks.
>
> >
>
>
>
> -
>
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
>
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>
>
>


RE: merging small files in HDFS

2016-11-03 Thread kumar, Senthil(AWF)
Can't we use getmerge here ?  If you requirement is to merge some files in a 
particular directory to single file .. 

hadoop fs -getmerge  

--Senthil
-Original Message-
From: Giovanni Mascari [mailto:giovanni.masc...@polito.it] 
Sent: Thursday, November 03, 2016 7:24 PM
To: Piyush Mukati <piyush.muk...@gmail.com>; user@hadoop.apache.org
Subject: Re: merging small files in HDFS

Hi,
if I correctly understand your request you need only to merge some data 
resulting from an hdfs write operation.
In this case, I suppose that your best option is to use hadoop-stream with 
'cat' command.

take a look here:
https://hadoop.apache.org/docs/r1.2.1/streaming.html

Regards

Il 03/11/2016 13:53, Piyush Mukati ha scritto:
> Hi,
> I want to merge multiple files in one HDFS dir to one file. I am 
> planning to write a map only job using input format which will create 
> only one inputSplit per dir.
> this way my job don't need to do any shuffle/sort.(only read and write 
> back to disk) Is there any such file format already implemented ?
> Or any there better solution for the problem.
>
> thanks.
>

-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



Re: merging small files in HDFS

2016-11-03 Thread Giovanni Mascari

Hi,
if I correctly understand your request you need only to merge some data 
resulting from an hdfs write operation.
In this case, I suppose that your best option is to use hadoop-stream 
with 'cat' command.


take a look here:
https://hadoop.apache.org/docs/r1.2.1/streaming.html

Regards

Il 03/11/2016 13:53, Piyush Mukati ha scritto:

Hi,
I want to merge multiple files in one HDFS dir to one file. I am 
planning to write a map only job using input format which will create 
only one inputSplit per dir.
this way my job don't need to do any shuffle/sort.(only read and write 
back to disk)

Is there any such file format already implemented ?
Or any there better solution for the problem.

thanks.



-
To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
For additional commands, e-mail: user-h...@hadoop.apache.org



merging small files in HDFS

2016-11-03 Thread Piyush Mukati
Hi,
I want to merge multiple files in one HDFS dir to one file. I am planning
to write a map only job using input format which will create only one
inputSplit per dir.
this way my job don't need to do any shuffle/sort.(only read and write back
to disk)
Is there any such file format already implemented ?
Or any there better solution for the problem.

thanks.