On Fri, Jul 10, 2009 at 1:16 AM, Marcus Herou <[email protected]>wrote:
> > However I am sure that we have more keys than that in our production data > so > I guess hadoop will throw the "Too many open files" exception then. Generally having lots of small files is very bad for performance. It sounds like you are headed that direction. Consider spilling your data into a Mapfile, hbase or Voldemort. That would allow you to access your data by key much as you would use a file name with multiple output files. Make sure you try hbase 0.20 for performance reasons. > I guess it is due to open/close stream efficiency that all streams are held > open but I think that one can be tweaked to be more flexible. This is also done because of the limitations on semantics that HDFS imposes. Files can only be written once. Append is still in the future. But aren't you grouping by your key in your reduce? If so, you can close each file as you finish processing the reduce group. If you aren't grouping by your key, why not? Run another step of MR and the problem of too many open files will disappear completely. That won't fix the architectural problem of storing your data in lots of little files, though. > Input ? Perhaps point me in the right direction and I can submit a "patch" > writing this myself. I think that this is the wrong approach because it will give you a non-scalable system and is going to be difficult to do well because your can't re-open files. HDFS file names are not a good substitute for a database because file lookup cannot be parallelized. BUT ... if you think you can make the change in a way useful to others, the process is very simple. File an issue on JIRA, then attach a patch. People will comment on the patch and the automated test system will help you think about how to make it better. If you can convince the committers of the utility of the patch, you are in. Convincing them that contributions are useful and safe is easier if you put your changes into the contrib rather than trying to make the changes in core. See here for more info: http://wiki.apache.org/hadoop/HowToContribute Be aware that Hadoop just splintered into several sub-projects due to the rate of contributions and discussion.
