Hi.

Used the MapFile as suggested, lot more efficient and clean. However issues
arise with what separator one should try to separate the fields but that is
minor. I used ctrl-a (ascii 001) in the test case and that worked very well.

Cheers

//Marcus


On Fri, Jul 10, 2009 at 7:06 PM, Ted Dunning <[email protected]> wrote:

> On Fri, Jul 10, 2009 at 1:16 AM, Marcus Herou <[email protected]
> >wrote:
>
> >
> > However I am sure that we have more keys than that in our production data
> > so
> > I guess hadoop will throw the "Too many open files" exception then.
>
>
> Generally having lots of small files is very bad for performance.  It
> sounds
> like you are headed that direction.

Very true and probably especially true on HDFS but performance was not the
case I was out for here it was just the grouping of the key/values in some
cases which a MapFile could solve

>
>
> Consider spilling your data into a Mapfile, hbase or Voldemort.  That would
> allow you to access your data by key much as you would use a file name with
> multiple output files.  Make sure you try hbase 0.20 for performance
> reasons.

MapFile seems like the best choice actually when thinking of it. I am only
going to store this kind of data as an intermediate format for later
sequential processing which makes it even more sane to have a MapFile

>
>
>
> > I guess it is due to open/close stream efficiency that all streams are
> held
> > open but I think that one can be tweaked to be more flexible.
>
>
> This is also done because of the limitations on semantics that HDFS
> imposes.  Files can only be written once.  Append is still in the future.

Ah of course. Seen that FR some times one the mailinglist

>
>
> But aren't you grouping by your key in your reduce?  If so, you can close
> each file as you finish processing the reduce group.

Yep I am, but think I'll just try a MapFIle

>
>
> If you aren't grouping by your key, why not?  Run another step of MR and
> the
> problem of too many open files will disappear completely.  That won't fix
> the architectural problem of storing your data in lots of little files,
> though.
>
>
> > Input ? Perhaps point me in the right direction and I can submit a
> "patch"
> > writing this myself.
>
>
> I think that this is the wrong approach because it will give you a
> non-scalable system and is going to be difficult to do well because your
> can't re-open files.  HDFS file names are not a good substitute for a
> database because file lookup cannot be parallelized.

Yep think as well it is the wrong approach at least in terms of performance.
The HDFS files where only an intermediate format which then later another
job parsed and inserted into a DB.

>
>
> BUT ... if you think you can make the change in a way useful to others, the
> process is very simple.  File an issue on JIRA, then attach a patch.
>  People
> will comment on the patch and the automated test system will help you think
> about how to make it better.  If you can convince the committers of the
> utility of the patch, you are in.  Convincing them that contributions are
> useful and safe is easier if you put your changes into the contrib rather
> than trying to make the changes in core.



>
>
> See here for more info:  http://wiki.apache.org/hadoop/HowToContribute
>
> Be aware that Hadoop just splintered into several sub-projects due to the
> rate of contributions and discussion.

Haha yes



-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/

Reply via email to