The problem is too many duplicate edges but the fact that as the algorithm grows the number of edges adjacent to the minim vertexes increases. In a worst case scenario the graph is completely connected and only has a single component. So far I have not figured out an efficient method of partitioning the edges of a single vertex while still passing on the minimum vertex to each partition.
The twitter dataset has approximately 44 million vertexes. Even with a graph this large the time necessary to process 44 million edges isn't too long. One advantage to the sorting approach is that duplicate edges become trivial to detect. Once we are able to process the twitter dataset we can work on further optimizations. Thanks, Neal. On Thu, Jun 10, 2010 at 5:31 PM, Ted Dunning <[email protected]> wrote: > Would a combiner help? > > On Thu, Jun 10, 2010 at 5:27 PM, Neal Clark <[email protected]> wrote: > > > Using this approach it should be possible to use do the bulk of the > > algorithm in the Mapper and then use the SecondarySort Reducer to sort > the > > output. > > >
