Re: [jira] Created: (HADOOP-1054) Add more then one input file per map?

Arkady Borkovsky Thu, 01 Mar 2007 10:47:20 -0800

The issue described here can be probably solved by specifying theappropriate number of map tasks and specifying custom input splits.

However I'd suggest a tool to be implemented that supports thefollowing operation on DFS files:

Concatenate several DFS files into a single one.
An option would specify whether it is done

-- destructively (the blocks of the files do not change, but justre-linked into a single file), or-- non-destructively (copy the data into a new file, with a differentblock size).Applied to a single file, this operation can be used to change theblock size.It can applied to a whole directory to turn the output of a map-reducejob into a single file without running another job

The latter is quite common operation. I usually do DFS -getmergefollowed by DFS -put. Quite ugly.


On Mar 1, 2007, at 10:22 AM, Johan Oskarson (JIRA) wrote:

Add more then one input file per map?
-------------------------------------

                 Key: HADOOP-1054
                 URL: https://issues.apache.org/jira/browse/HADOOP-1054
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
    Affects Versions: 0.11.2
            Reporter: Johan Oskarson
            Priority: Trivial
I've got a problem with mapreduce overhead when it comes to smallinput files.
Roughly 100 mb comes in to the dfs every few hours. Then afterwardsdata related to that batch might be added on for another few weeks.The problem is that this data is roughly 4-5 kbytes per file. So forevery reasonably big file we might have 4-5 small ones.
As far as I understand it each small file will get assigned a task ofit's own. This causes performance issues since the overhead of suchsmall
files is pretty big.
Would it be possible to have hadoop assign multiple files to a maptask up until a configurable limit?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Created: (HADOOP-1054) Add more then one input file per map?

Reply via email to