> > I am assuming the first job outputs multiple files and that the second (and > I presume a map-reduce job)
will assign the output intended for a single file to a single reducer (in some cases multiple output files might be supported - one per reducer - On issue is how to allow the reducer to write to some 'external file system' -.i.e. not hdfs or the instance's local file system but s3 on amazon or some mounted nfs system on a stand alone cluster > > On Oct 23, 2010, at 3:32 PM, "M. C. Srivas" <mcsri...@gmail.com> wrote: > > > Not with HDFS, since only one process may write to a single file (and its > > not visible until the file is closed). In fact, its worse than that ... > the > > same process that's writing that file cannot see it or read it until > after > > its done. > > > > If you have multiple reducers, you are simply out of luck and will have > to > > run a separate "job" to copy the data out. > > > > > > On Sat, Oct 23, 2010 at 3:08 PM, Steve Lewis <lordjoe2...@gmail.com> > wrote: > > > >> Once I run a map-reduce job I get output in the form of > >> part-r-00000 part-r-00001 ... > >> > >> In many cases the output is significantly smaller than the original > input - > >> take the classic word count > >> > >> In most cases I want to combine the output into a single file that may > well > >> not live on HDFS but on a more accessible file system > >> > >> Are there standard libraries or approaches for consolidating reducer > >> output. > >> > >> A second Map-Reduce job taking the output directory as an input is an OK > >> start but as output there needs to be a single reducer that > >> writes a real file and not reduce output - > >> > >> Are there standard libraries or approaches to this????? > >> > >> -- > >> Steven M. Lewis PhD > >> 4221 105th Ave Ne > >> Kirkland, WA 98033 > >> 206-384-1340 (cell) > >> Institute for Systems Biology > >> Seattle WA > >> > -- Steven M. Lewis PhD 4221 105th Ave Ne Kirkland, WA 98033 206-384-1340 (cell) Institute for Systems Biology Seattle WA