[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16035065#comment-16035065 ] ASF GitHub Bot commented on GIRAPH-1148: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/39 > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033823#comment-16033823 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119745185 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( PaircomponentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- Sounds good. Looks good then. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033817#comment-16033817 ] ASF GitHub Bot commented on GIRAPH-1148: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119744463 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( PaircomponentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- I can't come up with a reason why someone would want to change it. This can start having problems only at trillion components which wouldn't work for many other reasons, for tiny ones this few reducers won't add any overhead, and for larger ones which were currently working this is still improvement since reducers are processed on many machines now. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033539#comment-16033539 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119708329 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( PaircomponentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- Should this be configurable? > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033404#comment-16033404 ] ASF GitHub Bot commented on GIRAPH-1148: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119686989 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static ++ Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperationreduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducersreducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- Good catch, it didn't occur to me. I'll fix it not to reuse the same ReduceOperation object. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033358#comment-16033358 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119681042 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static ++ Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperationreduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducersreducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- Actually, because you're using an ArrayOfHandles, this is going to be serialized-deserialized as a whole. This should be ok. Though, this also assumes that reduce operations are stateless. In general, they are but there's nothing that prevents from writing a stateful one. Maybe add a comment about this. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16033356#comment-16033356 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119680240 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static ++ Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperationreduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducersreducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- This means that the same ReduceOperation instance is going to be shared across the different handles. Not entirely sure how the (de)serialization will work here. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16029759#comment-16029759 ] ASF GitHub Bot commented on GIRAPH-1148: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/39 [GIRAPH-1148] Connected components - make calculate sizes work with l… …arge number of components Summary: Currently if we have a graph with large number of connected components, calculating connected components sizes fails because reducer becomes too large. Use array of handles instead. Test Plan: Successfully ran the job which was failing without this change You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph cc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/39.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #39 commit 0b9ef9415a67737122e6fb6caeca5307df9632a1 Author: Maja KabiljoDate: 2017-05-30T17:10:17Z [GIRAPH-1148] Connected components - make calculate sizes work with large number of components Summary: Currently if we have a graph with large number of connected components, calculating connected components sizes fails because reducer becomes too large. Use array of handles instead. Test Plan: Successfully ran the job which was failing without this change > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)