[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723824#comment-15723824 ] ASF GitHub Bot commented on GIRAPH-1125: GitHub user heslami opened a pull request: https://github.com/apache/giraph/pull/12 [GIRAPH-1125] Memory estimation mechanism for more efficient OOC execution **Thanks to Dionysios Logothetis for helping a lot with this diff.** The new out-of-core mechanism is designed with the adaptivity goal in mind, meaning that we wanted the out-of-core mechanism to kick in only when it is necessary. In other words, when the amount of data (graph, messages, and mutations) all fit in memory, we want to take advantage of the entire memory. And, when in a stage the memory is short, only enough (minimal) amount of data goes out of core (to disk). This ensures a good performance for the out-of-core mechanism. To satisfy the adaptiveness goal, we need to know how much memory is used at each point of time. The default out-of-core mechanism (ThresholdBasedOracle) get memory information based on JVM's internal methods (Runtime's freeMemory()). This method is inaccurate (and pessimistic), meaning that it does not account for garbage data that has not been purged by GC. Using JVM's default methods, OOC behaves pessimistically and move data out of core even if it is not necessary. For instance, consider the case where there are a lot of garbage on the heap, but GC has not happened for a while. In this case, the default OOC pushes data on disk and immediately after a major GC it brings back the data to memory. This causes inefficiency in the default out of core mechanism. If out-of-core is used, but the data can entirely fit in memory, the job goes out of core even though going out of core is not necessary. To address this issue, we need to have a mechanism to more accurately know how much of heap is filled with non-garbage data. Consequently, we need to change the Oracle (OOC policy) to take advantage of a more accurate memory usage estimation. In this diff, we introduce a mechanism to estimate the amount of memory used by non-garbage data on the heap at each point of time. This estimation is based on the fact that Giraph is a data-parallel system in its essence, meaning that several types of threads exist, each type doing the same computation on various data. More specifically, we have compute/input threads, communication (receiving) threads, and OOC-IO threads. In a normal uniform execution, each type of threads behave similarly and contribute similarly to each other on the memory footprint (meaning that different compute threads contribute similarly to each other on the memory footprint). In the proposed approach, we use a measure of progress for each type of thread and use linear regression to estimate the amount of memory. The measure of progress for compute threads is the total number of vertices they have collectively processed in a superstep at each point, the measure of progress for communication threads is the total number of bytes received by a worker up to each point, and the measure of progress for IO threads is the amount of data read/written to disk up to each point during a superstep. These measures are restarted at the beginning of each superstep. We use these measures at the point where full GC happens (when we have the accurate estimation of non-garbage data on the heap) and devise the linear model of used memory. We then use the linear model to estimate the amount of memory at each time based on the above progress measures. You can merge this pull request into a Git repository by running: $ git pull https://github.com/heslami/giraph ooc-memory-estimation Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/12.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #12 commit 4d2b7c96da94aa26cddf75655dd6cb7d93bf9b1d Author: Hassan Eslami Date: 2016-12-05T23:37:29Z [GIRAPH-1125] Memory estimation mechanism for more efficient OOC execution The new out-of-core mechanism is designed with the adaptivity goal in mind, meaning that we wanted out-of-core mechanism to kick in only when it is necessary. In other words, when the amount of data (graph, messages, and mutations) all fit in memory, we want to take advantage of the entire memory. And, when in a stage the memory is short, only enough (minimal) amount of data goes out of core (to disk). This ensures a good performance for the out-of-core mechanism. To satisfy the adaptiveness goal, we need to know how much memory is used at each point of time. The default out-of-core mechanism (ThresholdBasedOracle) ge
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749458#comment-15749458 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92487131 --- Diff: giraph-core/src/main/java/org/apache/giraph/utils/ThreadLocalProgressCounter.java --- @@ -0,0 +1,67 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.utils; + +import java.util.ArrayList; +import java.util.List; + +/** + * Makes a list of {@link ProgressCounter} accessible through + * a {@link ThreadLocal}. + */ +public class ThreadLocalProgressCounter extends ThreadLocal { + /** + * List of counters. + */ + private final List counters = new ArrayList<>(); + + /** + * Initializes a new counter, adds it to the list of counters + * and returns it. + * @return Progress counter. + */ + @Override + protected ProgressCounter initialValue() { +ProgressCounter threadCounter = new ProgressCounter(); +synchronized (counters) { + counters.add(threadCounter); +} +return threadCounter; + } + + /** + * Sums the progress of all counters. + * @return Sum of all counters + */ + public long getProgress() { +long progress = 0; +synchronized (counters) { + for (ProgressCounter entry : counters) { +progress += entry.getValue(); + } +} +return progress; + } + + /** + * Removes all counters. + */ + public void reset() { --- End diff -- What is the purpose of this function and how do you use it? > Add memory estimation mechanism to out-of-core > -- > > Key: GIRAPH-1125 > URL: https://issues.apache.org/jira/browse/GIRAPH-1125 > Project: Giraph > Issue Type: Improvement >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > The new out-of-core mechanism is designed with the adaptivity goal in mind, > meaning that we wanted out-of-core mechanism to kick in only when it is > necessary. In other words, when the amount of data (graph, messages, and > mutations) all fit in memory, we want to take advantage of the entire memory. > And, when in a stage the memory is short, only enough (minimal) amount of > data goes out of core (to disk). This ensures a good performance for the > out-of-core mechanism. > To satisfy the adaptiveness goal, we need to know how much memory is used at > each point of time. The default out-of-core mechanism (ThresholdBasedOracle) > get memory information based on JVM's internal methods (Runtime's > freeMemory()). This method is inaccurate (and pessimistic), meaning that it > does not account for garbage data that has not been purged by GC. Using JVM's > default methods, OOC behaves pessimistically and move data out of core even > if it is not necessary. For instance, consider the case where there are a lot > of garbage on the heap, but GC has not happened for a while. In this case, > the default OOC pushes data on disk and immediately after a major GC it > brings back the data to memory. This causes inefficiency in the default out > of core mechanism. If out-of-core is used but the data can entirely fit in > memory, the job goes out of core even though going out of core is not > necessary. > To address this issue, we need to have a mechanism to more accurately know > how much of heap is filled with non-garbage data. Consequently, we need to > change the Oracle (OOC policy) to take advantage of a more accurate memory > usage estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755780#comment-15755780 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92903138 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755794#comment-15755794 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92903902 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755824#comment-15755824 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92904852 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755842#comment-15755842 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92905726 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755874#comment-15755874 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92906967 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755878#comment-15755878 ] ASF GitHub Bot commented on GIRAPH-1125: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r92907194 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If me
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762437#comment-15762437 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93129286 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762464#comment-15762464 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93131260 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762554#comment-15762554 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93137918 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762637#comment-15762637 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93143346 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762795#comment-15762795 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93152281 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764777#comment-15764777 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93288153 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764788#comment-15764788 ] ASF GitHub Bot commented on GIRAPH-1125: Github user heslami commented on a diff in the pull request: https://github.com/apache/giraph/pull/12#discussion_r93289032 --- Diff: giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java --- @@ -0,0 +1,851 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.giraph.ooc.policy; + +import com.sun.management.GarbageCollectionNotificationInfo; +import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression; +import org.apache.giraph.comm.NetworkMetrics; +import org.apache.giraph.conf.FloatConfOption; +import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration; +import org.apache.giraph.conf.LongConfOption; +import org.apache.giraph.edge.AbstractEdgeStore; +import org.apache.giraph.ooc.OutOfCoreEngine; +import org.apache.giraph.ooc.command.IOCommand; +import org.apache.giraph.ooc.command.LoadPartitionIOCommand; +import org.apache.giraph.ooc.command.WaitIOCommand; +import org.apache.giraph.worker.EdgeInputSplitsCallable; +import org.apache.giraph.worker.VertexInputSplitsCallable; +import org.apache.giraph.worker.WorkerProgress; +import org.apache.log4j.Logger; + +import java.lang.management.ManagementFactory; +import java.lang.management.MemoryPoolMXBean; +import java.lang.management.MemoryUsage; +import java.util.List; +import java.util.Map; +import java.util.Vector; +import java.util.concurrent.atomic.AtomicLong; +import java.util.concurrent.locks.Lock; +import java.util.concurrent.locks.ReentrantLock; + +import static com.google.common.base.Preconditions.checkState; + +/** + * Implementation of {@link OutOfCoreOracle} that uses a linear regression model + * to estimate actual memory usage based on the current state of computation. + * The model takes into consideration 5 parameters: + * + * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5 + * + * y: memory usage + * x1: edges loaded + * x2: vertices loaded + * x3: vertices processed + * x4: bytes received due to messages + * x5: bytes loaded/stored from/to disk due to OOC. + * + */ +public class MemoryEstimatorOracle implements OutOfCoreOracle { + /** Memory check interval in msec */ + public static final LongConfOption CHECK_MEMORY_INTERVAL = +new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000, +"The interval where memory checker thread wakes up and " + +"monitors memory footprint (in milliseconds)"); + /** + * If mem-usage is above this threshold and no Full GC has been called, + * we call it manually + */ + public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE = +new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f, +"The threshold above which GC is called manually if Full GC has not " + +"happened in a while"); + /** Used to detect a high memory pressure situation */ + public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION = +new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f, +"Minimum percentage of memory we expect to be reclaimed after a Full " + +"GC. If less than this amount is reclaimed, it is sage to say " + +"we are in a high memory situation and the estimation mechanism " + +"has not recognized it yet!"); + /** If mem-usage is above this threshold, active threads are set to 0 */ + public static final FloatConfOption AM_HIGH_THRESHOLD = +new FloatConfOption("giraph.amHighThreshold", 0.95f, +"If mem-usage is above this threshold, all active threads " + +"(compute/input) are paused."); + /** If m
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771345#comment-15771345 ] ASF GitHub Bot commented on GIRAPH-1125: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/12 Tests done for this diff: - Snapshots tests (those that were big enough to go out-of-core) both with the new and the old mechanism. - Run 4 production jobs and verified that performance is better than the previous mechanism. > Add memory estimation mechanism to out-of-core > -- > > Key: GIRAPH-1125 > URL: https://issues.apache.org/jira/browse/GIRAPH-1125 > Project: Giraph > Issue Type: Improvement >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > The new out-of-core mechanism is designed with the adaptivity goal in mind, > meaning that we wanted out-of-core mechanism to kick in only when it is > necessary. In other words, when the amount of data (graph, messages, and > mutations) all fit in memory, we want to take advantage of the entire memory. > And, when in a stage the memory is short, only enough (minimal) amount of > data goes out of core (to disk). This ensures a good performance for the > out-of-core mechanism. > To satisfy the adaptiveness goal, we need to know how much memory is used at > each point of time. The default out-of-core mechanism (ThresholdBasedOracle) > get memory information based on JVM's internal methods (Runtime's > freeMemory()). This method is inaccurate (and pessimistic), meaning that it > does not account for garbage data that has not been purged by GC. Using JVM's > default methods, OOC behaves pessimistically and move data out of core even > if it is not necessary. For instance, consider the case where there are a lot > of garbage on the heap, but GC has not happened for a while. In this case, > the default OOC pushes data on disk and immediately after a major GC it > brings back the data to memory. This causes inefficiency in the default out > of core mechanism. If out-of-core is used but the data can entirely fit in > memory, the job goes out of core even though going out of core is not > necessary. > To address this issue, we need to have a mechanism to more accurately know > how much of heap is filled with non-garbage data. Consequently, we need to > change the Oracle (OOC policy) to take advantage of a more accurate memory > usage estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773389#comment-15773389 ] ASF GitHub Bot commented on GIRAPH-1125: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/12 > Add memory estimation mechanism to out-of-core > -- > > Key: GIRAPH-1125 > URL: https://issues.apache.org/jira/browse/GIRAPH-1125 > Project: Giraph > Issue Type: Improvement >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > The new out-of-core mechanism is designed with the adaptivity goal in mind, > meaning that we wanted out-of-core mechanism to kick in only when it is > necessary. In other words, when the amount of data (graph, messages, and > mutations) all fit in memory, we want to take advantage of the entire memory. > And, when in a stage the memory is short, only enough (minimal) amount of > data goes out of core (to disk). This ensures a good performance for the > out-of-core mechanism. > To satisfy the adaptiveness goal, we need to know how much memory is used at > each point of time. The default out-of-core mechanism (ThresholdBasedOracle) > get memory information based on JVM's internal methods (Runtime's > freeMemory()). This method is inaccurate (and pessimistic), meaning that it > does not account for garbage data that has not been purged by GC. Using JVM's > default methods, OOC behaves pessimistically and move data out of core even > if it is not necessary. For instance, consider the case where there are a lot > of garbage on the heap, but GC has not happened for a while. In this case, > the default OOC pushes data on disk and immediately after a major GC it > brings back the data to memory. This causes inefficiency in the default out > of core mechanism. If out-of-core is used but the data can entirely fit in > memory, the job goes out of core even though going out of core is not > necessary. > To address this issue, we need to have a mechanism to more accurately know > how much of heap is filled with non-garbage data. Consequently, we need to > change the Oracle (OOC policy) to take advantage of a more accurate memory > usage estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core
[ https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781296#comment-15781296 ] ASF GitHub Bot commented on GIRAPH-1125: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/12 Somehow the file MemoryEstimatorOracle.java wasn't committed. I pulled the most recent changes, I can see the commit log (see below), but this file is missing. commit f5b685efa09b539b1f95925405723f7ac7b1dcea Author: Hassan Eslami Date: Fri Dec 23 12:03:37 2016 -0600 GIRAPH-1125 Closes #12 > Add memory estimation mechanism to out-of-core > -- > > Key: GIRAPH-1125 > URL: https://issues.apache.org/jira/browse/GIRAPH-1125 > Project: Giraph > Issue Type: Improvement >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > The new out-of-core mechanism is designed with the adaptivity goal in mind, > meaning that we wanted out-of-core mechanism to kick in only when it is > necessary. In other words, when the amount of data (graph, messages, and > mutations) all fit in memory, we want to take advantage of the entire memory. > And, when in a stage the memory is short, only enough (minimal) amount of > data goes out of core (to disk). This ensures a good performance for the > out-of-core mechanism. > To satisfy the adaptiveness goal, we need to know how much memory is used at > each point of time. The default out-of-core mechanism (ThresholdBasedOracle) > get memory information based on JVM's internal methods (Runtime's > freeMemory()). This method is inaccurate (and pessimistic), meaning that it > does not account for garbage data that has not been purged by GC. Using JVM's > default methods, OOC behaves pessimistically and move data out of core even > if it is not necessary. For instance, consider the case where there are a lot > of garbage on the heap, but GC has not happened for a while. In this case, > the default OOC pushes data on disk and immediately after a major GC it > brings back the data to memory. This causes inefficiency in the default out > of core mechanism. If out-of-core is used but the data can entirely fit in > memory, the job goes out of core even though going out of core is not > necessary. > To address this issue, we need to have a mechanism to more accurately know > how much of heap is filled with non-garbage data. Consequently, we need to > change the Oracle (OOC policy) to take advantage of a more accurate memory > usage estimation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822101#comment-15822101 ] ASF GitHub Bot commented on GIRAPH-1129: GitHub user ikabiljo opened a pull request: https://github.com/apache/giraph/pull/14 Generate more useful functional interfaces GIRAPH-1129 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ikabiljo/giraph gen_prim Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/14.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #14 commit 7ba49ecf6e4999aa7fb73d3b97dba8d86eacbc4a Author: Igor Kabiljo Date: 2017-01-13T09:06:53Z Generate more useful functional interfaces > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822147#comment-15822147 ] ASF GitHub Bot commented on GIRAPH-1129: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/13 > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822204#comment-15822204 ] ASF GitHub Bot commented on GIRAPH-1129: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/14 > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822498#comment-15822498 ] ASF GitHub Bot commented on GIRAPH-1129: GitHub user ikabiljo opened a pull request: https://github.com/apache/giraph/pull/15 Allow Pieces that don't go over vertices, but over "recepients of messages" GIRAPH-1129 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ikabiljo/giraph no_vtx Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/15.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15 commit ee1f5ee3aeacaac7422c61189b5b3e03f7b31db8 Author: Igor Kabiljo Date: 2017-01-13T23:03:34Z Allow Pieces that don't go over vertices, but over "recepients of messages" > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1130) Fix RepeatUntilBlock
[ https://issues.apache.org/jira/browse/GIRAPH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832198#comment-15832198 ] ASF GitHub Bot commented on GIRAPH-1130: GitHub user ikabiljo opened a pull request: https://github.com/apache/giraph/pull/16 Fix RepeatUntilBlock https://issues.apache.org/jira/browse/GIRAPH-1130 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ikabiljo/giraph fix_repeat Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/16.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #16 commit adc653fc11644585c4d0d79001c2a7eaebd4cf62 Author: Igor Kabiljo Date: 2017-01-20T17:07:03Z Fix RepeatUntilBlock > Fix RepeatUntilBlock > > > Key: GIRAPH-1130 > URL: https://issues.apache.org/jira/browse/GIRAPH-1130 > Project: Giraph > Issue Type: Bug >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832250#comment-15832250 ] ASF GitHub Bot commented on GIRAPH-1129: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/15 > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1129) SocialHash
[ https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832718#comment-15832718 ] ASF GitHub Bot commented on GIRAPH-1129: GitHub user ikabiljo opened a pull request: https://github.com/apache/giraph/pull/17 Extending generated code and adding ShortWritable - adding ShortWritable - adding all combinations for T2TFunction - making BasicSet/Map being generated - no code changes to implementations, just generating byte/short as well GIRAPH-1129 You can merge this pull request into a Git repository by running: $ git pull https://github.com/ikabiljo/giraph primitives Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/17.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17 commit 1488a03a1ab72a2d42fe93a29c0330f04ea4304c Author: Igor Kabiljo Date: 2017-01-17T19:03:48Z Extending generated code and adding ShortWritable > SocialHash > -- > > Key: GIRAPH-1129 > URL: https://issues.apache.org/jira/browse/GIRAPH-1129 > Project: Giraph > Issue Type: New Feature >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > > Hypergraph graph partitioner -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1130) Fix RepeatUntilBlock
[ https://issues.apache.org/jira/browse/GIRAPH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843372#comment-15843372 ] ASF GitHub Bot commented on GIRAPH-1130: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/16 > Fix RepeatUntilBlock > > > Key: GIRAPH-1130 > URL: https://issues.apache.org/jira/browse/GIRAPH-1130 > Project: Giraph > Issue Type: Bug >Reporter: Igor Kabiljo >Assignee: Igor Kabiljo > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (GIRAPH-1043) Implementation of Darwini graph generator
[ https://issues.apache.org/jira/browse/GIRAPH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850914#comment-15850914 ] ASF GitHub Bot commented on GIRAPH-1043: GitHub user edunov opened a pull request: https://github.com/apache/giraph/pull/19 GIRAPH-1043 Implementation of Darwini graph generator You can merge this pull request into a Git repository by running: $ git pull https://github.com/edunov/giraph darwini Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/19.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19 commit 08aac32cf8440825921cb1eea935ce5b76c0a784 Author: Sergey Edunov Date: 2016-03-10T21:48:47Z Darwini graph generator > Implementation of Darwini graph generator > - > > Key: GIRAPH-1043 > URL: https://issues.apache.org/jira/browse/GIRAPH-1043 > Project: Giraph > Issue Type: Task >Reporter: Sergey Edunov >Assignee: Sergey Edunov > > Implementation of graph generator that is able to capture many properties of > social graphs, such as high local clustering coefficient, non-power law > degree distributions and log normal joint degree distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts
[ https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891109#comment-15891109 ] ASF GitHub Bot commented on GIRAPH-1132: GitHub user edunov opened a pull request: https://github.com/apache/giraph/pull/21 GIRAPH-1132 Giraph jobs don't end if zookeeper dies before job starts I'm not sure I set all the timeouts right. There is no way to test all of these either. The idea is that we shouldn't have infinite wait loops anywhere. And that's exactly what this diff does You can merge this pull request into a Git repository by running: $ git pull https://github.com/edunov/giraph timeout Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/21.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21 commit cdbe7d4a46d80611fb5846eeeab37b94e66781a1 Author: Sergey Edunov Date: 2017-03-01T21:41:37Z GIRAPH-1132 Giraph jobs don't end if zookeeper dies before job starts > Giraph jobs don't end if zookeeper dies before job starts > - > > Key: GIRAPH-1132 > URL: https://issues.apache.org/jira/browse/GIRAPH-1132 > Project: Giraph > Issue Type: Bug >Reporter: Sergey Edunov > > There are multiple places in the Giraph code where we waitForever() on some > event (e.g. all workers to finish or zookeeper to come up). This is in > general bad, as any issue on other side may become undetected and make job > run forever. We need to introduce timeout to these waits -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts
[ https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891144#comment-15891144 ] ASF GitHub Bot commented on GIRAPH-1132: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/21#discussion_r103799656 --- Diff: giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java --- @@ -1238,5 +1239,27 @@ BooleanConfOption PREFER_IP_ADDRESSES = new BooleanConfOption("giraph.preferIP", false, "Prefer IP addresses instead of host names"); + + /** + * Timeout for "waitForever", when we need to wait for zookeeper. + * Since we should never really have to wait forever. + * We should only wait some reasonable but large amount of time. + */ + LongConfOption WAIT_FOREVER_ZOOKEEPER_TIMEOUT_MSEC = --- End diff -- Nit: since we are not waiting forever anymore, I'd drop word forever from everywhere (forever and timeout have opposite meaning :-)) > Giraph jobs don't end if zookeeper dies before job starts > - > > Key: GIRAPH-1132 > URL: https://issues.apache.org/jira/browse/GIRAPH-1132 > Project: Giraph > Issue Type: Bug >Reporter: Sergey Edunov > > There are multiple places in the Giraph code where we waitForever() on some > event (e.g. all workers to finish or zookeeper to come up). This is in > general bad, as any issue on other side may become undetected and make job > run forever. We need to introduce timeout to these waits -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts
[ https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891157#comment-15891157 ] ASF GitHub Bot commented on GIRAPH-1132: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/21 > Giraph jobs don't end if zookeeper dies before job starts > - > > Key: GIRAPH-1132 > URL: https://issues.apache.org/jira/browse/GIRAPH-1132 > Project: Giraph > Issue Type: Bug >Reporter: Sergey Edunov > > There are multiple places in the Giraph code where we waitForever() on some > event (e.g. all workers to finish or zookeeper to come up). This is in > general bad, as any issue on other side may become undetected and make job > run forever. We need to introduce timeout to these waits -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1133) Fix JobProgressTracker in OverrideExceptionHandler
[ https://issues.apache.org/jira/browse/GIRAPH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894934#comment-15894934 ] ASF GitHub Bot commented on GIRAPH-1133: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/22 GIRAPH-1133: Fix JobProgressTracker in OverrideExceptionHandler Summary: We create OverrideExceptionHandler before JobProgressTracker, so it can't report errors to command line. Test Plan: Ran a job with exception caught by OverrideExceptionHandler before and after the change You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph jobProgress Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/22.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22 commit 47c11d65b5f740f374053e9b842e498e7802e919 Author: Maja Kabiljo Date: 2017-03-03T19:40:19Z GIRAPH-1133: Fix JobProgressTracker in OverrideExceptionHandler Summary: We create OverrideExceptionHandler before JobProgressTracker, so it can't report errors to command line. Test Plan: Ran a job with exception caught by OverrideExceptionHandler before and after the change > Fix JobProgressTracker in OverrideExceptionHandler > -- > > Key: GIRAPH-1133 > URL: https://issues.apache.org/jira/browse/GIRAPH-1133 > Project: Giraph > Issue Type: Bug >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > We create OverrideExceptionHandler before JobProgressTracker, so it can't > report errors to command line. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1134) Track number of input splits in command line
[ https://issues.apache.org/jira/browse/GIRAPH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899815#comment-15899815 ] ASF GitHub Bot commented on GIRAPH-1134: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/24 GIRAPH-1134: Track number of input splits in command line Summary: The progress we track during input reports how much data have we read, but not how much data there is to read. Test Plan: Now it prints something like: Loading data: 70046603 vertices loaded, 658 vertex input splits loaded (out of 659); 7122062703 edges loaded, 4 edge input splits loaded (out of 1322) You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph numSplits Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/24.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #24 commit 091dc7bcb819139f517f37c67f7dc0748965a1b8 Author: Maja Kabiljo Date: 2017-03-07T17:32:13Z GIRAPH-1134: Track number of input splits in command line Summary: The progress we track during input reports how much data have we read, but not how much data there is to read. Test Plan: Now it prints something like: Loading data: 70046603 vertices loaded, 658 vertex input splits loaded (out of 659); 7122062703 edges loaded, 4 edge input splits loaded (out of 1322) > Track number of input splits in command line > > > Key: GIRAPH-1134 > URL: https://issues.apache.org/jira/browse/GIRAPH-1134 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > The progress we track during input reports how much data have we read, but > not how much data there is to read. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1133) Fix JobProgressTracker in OverrideExceptionHandler
[ https://issues.apache.org/jira/browse/GIRAPH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900132#comment-15900132 ] ASF GitHub Bot commented on GIRAPH-1133: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/22 > Fix JobProgressTracker in OverrideExceptionHandler > -- > > Key: GIRAPH-1133 > URL: https://issues.apache.org/jira/browse/GIRAPH-1133 > Project: Giraph > Issue Type: Bug >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > We create OverrideExceptionHandler before JobProgressTracker, so it can't > report errors to command line. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1043) Implementation of Darwini graph generator
[ https://issues.apache.org/jira/browse/GIRAPH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924190#comment-15924190 ] ASF GitHub Bot commented on GIRAPH-1043: Github user cheonruen commented on the issue: https://github.com/apache/giraph/pull/19 Do you have any tutorial for this? > Implementation of Darwini graph generator > - > > Key: GIRAPH-1043 > URL: https://issues.apache.org/jira/browse/GIRAPH-1043 > Project: Giraph > Issue Type: Task >Reporter: Sergey Edunov >Assignee: Sergey Edunov > > Implementation of graph generator that is able to capture many properties of > social graphs, such as high local clustering coefficient, non-power law > degree distributions and log normal joint degree distribution. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928741#comment-15928741 ] ASF GitHub Bot commented on GIRAPH-1137: GitHub user heslami opened a pull request: https://github.com/apache/giraph/pull/26 [GIRAPH-1137] Remove channel probing from Netty worker thread for credit-based flow… In credit-based flow-control, sometimes, client threads (one type of Netty worker threads used in Giraph) try to send requests to other workers. This is bad practice for Netty and can cause Netty to mark the execution as deadlock-prone (an example exception shown below). Client threads should only be responsible for sending ACK/NACK messages in response to requests, and they should do so by reuseing the channel from which they received the request. In the current implementation, client threads may try to send unsent/cached requests in credit-based flow control. Sending such requests should be delegated to other threads. WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise@2c455378(incomplete) at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) at org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) at org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) at org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) at org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) at org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) at org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) at org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) at org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) at org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) at org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) at io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) at io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) at java.lang.Thread.run(Thread.java:745) You can merge this pull request into a Git repository by running: $ git pull https://github.com/heslami/giraph fix-credit-based Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/26.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #26 commit 4c8186cc8097877d5af20ef054630a629caaa026 Author: Hassan Eslami Date: 2017-03-16T19:52:12Z Remove channel probing from Netty worker thread for credit-based flow-control Closes GIRAPH-1137 > Remove channel probing from Netty worker thread for credit-based flow-control > - > >
[jira] [Commented] (GIRAPH-1134) Track number of input splits in command line
[ https://issues.apache.org/jira/browse/GIRAPH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930356#comment-15930356 ] ASF GitHub Bot commented on GIRAPH-1134: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/24 > Track number of input splits in command line > > > Key: GIRAPH-1134 > URL: https://issues.apache.org/jira/browse/GIRAPH-1134 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > The progress we track during input reports how much data have we read, but > not how much data there is to read. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933077#comment-15933077 ] ASF GitHub Bot commented on GIRAPH-1137: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/26 There are a couple of checkstyle errors: giraph/giraph-core/target/munged/main/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java:48:8: Unused import - java.util.concurrent.LinkedBlockingQueue. giraph/giraph-core/target/munged/main/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java:172:3: Missing a Javadoc comment. > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933911#comment-15933911 ] ASF GitHub Bot commented on GIRAPH-1137: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/26 Tests done: - Snapshot tests. - Performance on large PageRank benchmark remains the same. - Performance on internal prod job remains the same. > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934801#comment-15934801 ] ASF GitHub Bot commented on GIRAPH-1137: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/26 Error was reproduced by decreasing the netty client threads. > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934816#comment-15934816 ] ASF GitHub Bot commented on GIRAPH-1137: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/26#discussion_r107200544 --- Diff: giraph-core/src/main/java/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java --- @@ -215,10 +231,38 @@ public void run() { } } }); -thread.setUncaughtExceptionHandler(exceptionHandler); -thread.setName("resume-sender"); -thread.setDaemon(true); -thread.start(); +resumeHandlerThread.setUncaughtExceptionHandler(exceptionHandler); +resumeHandlerThread.setName("resume-sender"); +resumeHandlerThread.setDaemon(true); +resumeHandlerThread.start(); + +// Thread to handle/send cached requests +Thread cachedRequestHandlerThread = new Thread(new Runnable() { + @Override + public void run() { +while (true) { + Pair pair = null; + try { +pair = toBeSent.take(); + } catch (InterruptedException e) { +throw new IllegalStateException("run: failed while waiting to " + +"take an element from the request queue!", e); + } + int taskId = pair.getLeft(); + WritableRequest request = pair.getRight(); + nettyClient.doSend(taskId, request); + if (aggregateUnsentRequests.decrementAndGet() == 0) { +synchronized (aggregateUnsentRequests) { + aggregateUnsentRequests.notifyAll(); +} + } +} + } +}); + cachedRequestHandlerThread.setUncaughtExceptionHandler(exceptionHandler); +cachedRequestHandlerThread.setName("cached-req-sender"); +cachedRequestHandlerThread.setDaemon(true); +cachedRequestHandlerThread.start(); --- End diff -- You can create a utility method like ThreadUtils.startThread with exception handler. > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.ne
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936985#comment-15936985 ] ASF GitHub Bot commented on GIRAPH-1138: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/27 [GIRAPH-1138] Don't wrap exceptions from executor service Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions from underlying threads, making logs hard to read. We should re-throw original exception when possible. Test Plan: Ran a job which fails in one of input threads before and after change, verified exception is clear now You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph exceptions Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/27.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #27 commit a460f582d9b8aa875cb32ca4f8dae1c320e9ac23 Author: Maja Kabiljo Date: 2017-03-22T19:39:19Z [GIRAPH-1138] Don't wrap exceptions from executor service Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions from underlying threads, making logs hard to read. We should re-throw original exception when possible. Test Plan: Ran a job which fails in one of input threads before and after change, verified exception is clear now > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943777#comment-15943777 ] ASF GitHub Bot commented on GIRAPH-1137: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/26 > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943980#comment-15943980 ] ASF GitHub Bot commented on GIRAPH-1139: GitHub user neggert opened a pull request: https://github.com/apache/giraph/pull/30 [GIRAPH-1139] Fix resuming from checkpoint A couple of fixes that get resuming from checkpoint working. * Set checkpointStatus to NONE in master when restarting from checkpoint. Workers already do this, so the job hangs when restarting from checkpoint while the master waits for workers to create checkpoints they're never going to create. * Set unique task id for each worker attempt Previously, a worker would reuse the task id from the prior attempt. This gets propagated to the Netty client id, which makes the master think it has already processed any requests that come from that client, causing it to discard them. This obviously causes problems. And also a fix for GIRAPH-1136. We will now checkpoint on superstep 0 if checkpointing is enabled. Let me know if you'd rather I sent a separate PR for this. Testing: Ran custom Label Propagation implementation with checkpointing on a ~5b node graph. Manually killed workers (by logging in to worker node and running `kill -9 `. Ensured that Giraph successfully resumed from most recent checkpoint. You can merge this pull request into a Git repository by running: $ git pull https://github.com/neggert/giraph trunk_resume_fixes Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/30.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #30 commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f Author: NicEggert Date: 2017-03-23T20:23:03Z Set checkpointStatus to NONE in master when restarting from checkpoint. Workers already do this, so the job hangs when restarting from checkpoint while the master waits for workers to create checkpoints they're never going to create. commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c Author: NicEggert Date: 2017-03-23T20:26:02Z Checkpoint on superstep 0 if checkpointing is enabled commit 74bba4573dbb77242d81352f84969b114db1cb71 Author: NicEggert Date: 2017-03-23T20:26:47Z Set unique task id for each worker attempt Previously, a worker would reuse the task id from the prior attempt. This gets propagated to the Netty client id, which makes the master think it has already processed any requests that come from that client, causing it to discard them. This obviously causes problems. > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control
[ https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947619#comment-15947619 ] ASF GitHub Bot commented on GIRAPH-1137: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/26 @heslami, findbugs reports this: [INFO] [INFO] Synchronization performed on java.util.concurrent.atomic.AtomicInteger in org.apache.giraph.comm.flow_control.CreditBasedFlowControl$2.run() ["org.apache.giraph.comm.flow_control.CreditBasedFlowControl$2"] At CreditBasedFlowControl.java:[lines 237-256] > Remove channel probing from Netty worker thread for credit-based flow-control > - > > Key: GIRAPH-1137 > URL: https://issues.apache.org/jira/browse/GIRAPH-1137 > Project: Giraph > Issue Type: Bug >Reporter: Hassan Eslami >Assignee: Hassan Eslami > > In credit-based flow-control, sometimes, client threads (one type of Netty > worker threads used in Giraph) try to send requests to other workers. This is > bad practice for Netty and can cause Netty to mark the execution as > deadlock-prone (an example exception shown below). Client threads should only > be responsible for sending ACK/NACK messages in response to requests, and > they should do so by reuseing the channel from which they received the > request. In the current implementation, client threads may try to send > unsent/cached requests in credit-based flow control. Sending such requests > should be delegated to other threads. > WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] > io.netty.util.concurrent.BlockingOperationException: > DefaultChannelPromise@2c455378(incomplete) > at > io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383) > at > io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157) > at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343) > at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259) > at > org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461) > at > org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180) > at > org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165) > at > org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132) > at > org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715) > at > org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799) > at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515) > at > org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485) > at > org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840) > at > org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89) > at > io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338) > at > io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324) > at > io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785) > at > io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126) > at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485) > at > io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346) > at > io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1140) Cleanup temp files in hdfs after job is done
[ https://issues.apache.org/jira/browse/GIRAPH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949473#comment-15949473 ] ASF GitHub Bot commented on GIRAPH-1140: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/32 GIRAPH-1140: Cleanup temp files in hdfs after job is done Summary: Currently we are not cleaning up temp files we create in hdfs, fix it. Test Plan: Ran a few jobs (successful, failed, killed), verified files are removed in all cases. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph zkCleanup Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/32.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #32 commit d42386e8430d0d66a49b7163e9579b9dd1b88747 Author: Maja Kabiljo Date: 2017-03-30T17:32:35Z GIRAPH-1140: Cleanup temp files in hdfs after job is done Summary: Currently we are not cleaning up temp files we create in hdfs, fix it. Test Plan: Ran a few jobs (successful, failed, killed), verified files are removed in all cases. > Cleanup temp files in hdfs after job is done > > > Key: GIRAPH-1140 > URL: https://issues.apache.org/jira/browse/GIRAPH-1140 > Project: Giraph > Issue Type: Bug >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently we are not cleaning up temp files we create in hdfs, fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949737#comment-15949737 ] ASF GitHub Bot commented on GIRAPH-1138: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/27 > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1140) Cleanup temp files in hdfs after job is done
[ https://issues.apache.org/jira/browse/GIRAPH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951611#comment-15951611 ] ASF GitHub Bot commented on GIRAPH-1140: Github user majakabiljo closed the pull request at: https://github.com/apache/giraph/pull/32 > Cleanup temp files in hdfs after job is done > > > Key: GIRAPH-1140 > URL: https://issues.apache.org/jira/browse/GIRAPH-1140 > Project: Giraph > Issue Type: Bug >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently we are not cleaning up temp files we create in hdfs, fix it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made
[ https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951612#comment-15951612 ] ASF GitHub Bot commented on GIRAPH-1141: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/33 GIRAPH-1141: Kill the job if no progress is being made Summary: Sometimes jobs can get stuck for various reasons, it's better to have an option to kill them then to keep them running holding resources. Test Plan: Ran a large job with shorter limit and verified it gets killed. Also ran normal successful job. mvn verify You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph progress Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/33.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #33 commit 7b571bbbefade67b3b77273e6b291318399d810e Author: Maja Kabiljo Date: 2017-03-31T20:40:48Z GIRAPH-1141: Kill the job if no progress is being made Summary: Sometimes jobs can get stuck for various reasons, it's better to have an option to kill them then to keep them running holding resources. Test Plan: Ran a large job with shorter limit and verified it gets killed. Also ran normal successful job. mvn verify > Kill the job if no progress is being made > - > > Key: GIRAPH-1141 > URL: https://issues.apache.org/jira/browse/GIRAPH-1141 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > Sometimes jobs can get stuck for various reasons, it's better to have an > option to kill them then to keep them running holding resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made
[ https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951648#comment-15951648 ] ASF GitHub Bot commented on GIRAPH-1141: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/33#discussion_r109250592 --- Diff: giraph-core/src/main/java/org/apache/giraph/job/CombinedWorkerProgress.java --- @@ -204,4 +215,19 @@ public String toString() { } return sb.toString(); } + + /** + * Check if this instance made progress from another instance + * + * @param lastProgress Instance to compare with + * @return True iff progress was made + */ + public boolean madeProgressFrom(CombinedWorkerProgress lastProgress) { --- End diff -- Why not use the underlying raw numbers instead of the string? For instance, small changes in memory may not really mean progress. > Kill the job if no progress is being made > - > > Key: GIRAPH-1141 > URL: https://issues.apache.org/jira/browse/GIRAPH-1141 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > Sometimes jobs can get stuck for various reasons, it's better to have an > option to kill them then to keep them running holding resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made
[ https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951657#comment-15951657 ] ASF GitHub Bot commented on GIRAPH-1141: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/33#discussion_r109251665 --- Diff: giraph-core/src/main/java/org/apache/giraph/job/CombinedWorkerProgress.java --- @@ -204,4 +215,19 @@ public String toString() { } return sb.toString(); } + + /** + * Check if this instance made progress from another instance + * + * @param lastProgress Instance to compare with + * @return True iff progress was made + */ + public boolean madeProgressFrom(CombinedWorkerProgress lastProgress) { --- End diff -- That's why I separated getProgressString from toString, to only contain actual progress. For different supersteps we are looking at different numbers so this seemed the easiest to compare instead of having all the if-s. > Kill the job if no progress is being made > - > > Key: GIRAPH-1141 > URL: https://issues.apache.org/jira/browse/GIRAPH-1141 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > Sometimes jobs can get stuck for various reasons, it's better to have an > option to kill them then to keep them running holding resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made
[ https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953731#comment-15953731 ] ASF GitHub Bot commented on GIRAPH-1141: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/33 i missed this detail. looks ok to me. > Kill the job if no progress is being made > - > > Key: GIRAPH-1141 > URL: https://issues.apache.org/jira/browse/GIRAPH-1141 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > Sometimes jobs can get stuck for various reasons, it's better to have an > option to kill them then to keep them running holding resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made
[ https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955369#comment-15955369 ] ASF GitHub Bot commented on GIRAPH-1141: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/33 > Kill the job if no progress is being made > - > > Key: GIRAPH-1141 > URL: https://issues.apache.org/jira/browse/GIRAPH-1141 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > Sometimes jobs can get stuck for various reasons, it's better to have an > option to kill them then to keep them running holding resources. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963189#comment-15963189 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Could someone please take a look at this? @majakabiljo @edunov > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969747#comment-15969747 ] ASF GitHub Bot commented on GIRAPH-1139: Github user edunov commented on the issue: https://github.com/apache/giraph/pull/30 Hi Nic, thank you for fixing this and sorry for the delay. I'll take a look at this diff other the weekend > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971306#comment-15971306 ] ASF GitHub Bot commented on GIRAPH-1139: Github user edunov commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r111767933 --- Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java --- @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) { if (checkpointFrequency == 0) { return CheckpointStatus.NONE; } -long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency; +long firstCheckpoint = INPUT_SUPERSTEP + 1; --- End diff -- What is the reason for changing this? Do you want it to always do checkpoint after the first superstep? > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971315#comment-15971315 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 I'm finding that roughly half of my ~2 hour job time is spent just loading the graph. Resuming from checkpoint lets me skip that step if something fails. This is also why I want to make sure that Giraph checkpoints before starting superstep 0. (It used to work this way, and it's documented in the Giraph book.) > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971326#comment-15971326 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r111771953 --- Diff: giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java --- @@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long superstep) { if (checkpointFrequency == 0) { return CheckpointStatus.NONE; } -long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency; +long firstCheckpoint = INPUT_SUPERSTEP + 1; --- End diff -- This will actually checkpoint before running superstep 0. You're basically checkpointing the input loading work. > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters
[ https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971460#comment-15971460 ] ASF GitHub Bot commented on GIRAPH-1143: GitHub user aching opened a pull request: https://github.com/apache/giraph/pull/35 GIRAPH-1143 Handle non-legacy Hadoop job errors nicely. You can merge this pull request into a Git repository by running: $ git pull https://github.com/aching/giraph giraph-1143 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/35.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #35 commit cad037526c7b41ff46c4de824dac205c349a20ba Author: Avery Ching Date: 2017-04-17T18:30:17Z GIRAPH-1143 > org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy > Hadoop clusters > - > > Key: GIRAPH-1143 > URL: https://issues.apache.org/jira/browse/GIRAPH-1143 > Project: Giraph > Issue Type: Bug >Reporter: Avery Ching >Assignee: Avery Ching >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters
[ https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971506#comment-15971506 ] ASF GitHub Bot commented on GIRAPH-1143: Github user aching commented on the issue: https://github.com/apache/giraph/pull/35 Merged. > org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy > Hadoop clusters > - > > Key: GIRAPH-1143 > URL: https://issues.apache.org/jira/browse/GIRAPH-1143 > Project: Giraph > Issue Type: Bug >Reporter: Avery Ching >Assignee: Avery Ching >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters
[ https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971505#comment-15971505 ] ASF GitHub Bot commented on GIRAPH-1143: Github user aching closed the pull request at: https://github.com/apache/giraph/pull/35 > org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy > Hadoop clusters > - > > Key: GIRAPH-1143 > URL: https://issues.apache.org/jira/browse/GIRAPH-1143 > Project: Giraph > Issue Type: Bug >Reporter: Avery Ching >Assignee: Avery Ching >Priority: Trivial > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979370#comment-15979370 ] ASF GitHub Bot commented on GIRAPH-1139: Github user majakabiljo commented on the issue: https://github.com/apache/giraph/pull/30 Can we get rid of getHostnamePartitionId() to avoid incorrectly using it in the future? I see various other places where taskPartition is used for identifier, do any of them need to be updated too? > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980471#comment-15980471 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 There actually is [one instance][1] where I think it's okay to use `getHostnamePartitionId`. This happens in `BspServiceMaster.becomeMaster` when creating a master bid in ZK, before any `TaskInfo` instance is created. This does make me realize, though, that I need to make the same change to how the task id is set in `BspServiceMaster`. What about just changing how `taskPartition` is set in `BspService`, like so? this.taskPartition = (int)getApplicationAttempt() * conf.getMaxWorkers() + getTaskPartition(); I think this is actually the minimal code change to fix the issue. I don't see anywhere in the code that actually cares about the task partition as anything other than a unique identifier. [1]: https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java#L805 > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981327#comment-15981327 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 This is ready for another look. I've replaced partition id with task id. The only place that actually needs a partition id is logging missing workers in `BspServiceMaster`. In that case, it's easy enough to recover the partition id by taking the task id modulo the number of workers. > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997364#comment-15997364 ] ASF GitHub Bot commented on GIRAPH-1146: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/36 GIRAPH-1146: Keep track of number of supersteps when possible Summary: In many cases we know how many supersteps are there going to be. We can keep track of it and log it with progress. Test Plan: Ran a job, example log line: Data from 3 workers - Compute superstep 5 (out of 6): 171824 out of 1304814 vertices computed; 19 out of 252 partitions computed You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph numSupersteps Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/36.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #36 commit 4a4d69b674d327f3cd092036a7e3fec4ffb4c2ce Author: Maja Kabiljo Date: 2017-05-04T20:07:22Z GIRAPH-1146: Keep track of number of supersteps when possible Summary: In many cases we know how many supersteps are there going to be. We can keep track of it and log it with progress. Test Plan: Ran a job, example log line: Data from 3 workers - Compute superstep 5 (out of 6): 171824 out of 1304814 vertices computed; 19 out of 252 partitions computed > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997373#comment-15997373 ] ASF GitHub Bot commented on GIRAPH-1138: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/37 [GIRAPH-1138] Don't wrap exceptions from executor service Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions from underlying threads, making logs hard to read. We should re-throw original exception when possible. (accidentally closed #27) Test Plan: Ran a job which fails in one of input threads before and after change, verified exception is clear now You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph exceptions2 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit c859833de5a28e580b38dc46a8b371cd82b6b6d0 Author: Maja Kabiljo Date: 2017-05-04T20:15:44Z [GIRAPH-1138] Don't wrap exceptions from executor service Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions from underlying threads, making logs hard to read. We should re-throw original exception when possible. (accidentally closed #27) Test Plan: Ran a job which fails in one of input threads before and after change, verified exception is clear now > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998634#comment-15998634 ] ASF GitHub Bot commented on GIRAPH-1146: Github user ikabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115049370 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/PieceCount.java --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.giraph.block_app.framework.block; + +import com.google.common.base.Objects; + +/** + * Number of pieces + */ +public class PieceCount { + private boolean known; + private int count; + + public PieceCount(int count) { +known = true; +this.count = count; + } + + private PieceCount() { +known = false; + } + + public static PieceCount createUnknownCount() { +return new PieceCount(); + } + + + public PieceCount add(PieceCount other) { +if (!this.known || !other.known) { + known = false; +} else { + count += other.count; +} +return this; + } + + public PieceCount multiply(int value) { +count *= value; +return this; + } + + public int getCount() { +return known ? count : Integer.MAX_VALUE; --- End diff -- this might easily lead to overflow if anything is done with this number. You should either fatal (better), or return 1M or something here. > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998637#comment-15998637 ] ASF GitHub Bot commented on GIRAPH-1146: Github user ikabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115049666 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/piece/delegate/DelegatePiece.java --- @@ -249,6 +250,15 @@ public void forAllPossiblePieces(Consumer consumer) { } } + @Override + public PieceCount getPieceCount() { +PieceCount ret = new PieceCount(0); --- End diff -- this should be 1, this executes all pieces simultaneously. > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998635#comment-15998635 ] ASF GitHub Bot commented on GIRAPH-1146: Github user ikabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115049981 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/BlockUtils.java --- @@ -147,6 +148,11 @@ public static void initAndCheckConfig(GiraphConfiguration conf) { checkBlockTypes( executionBlock, blockFactory.createExecutionStage(immConf), immConf); +PieceCount pieceCount = executionBlock.getPieceCount(); +if (pieceCount.isKnown()) { + GiraphConstants.SUPERSTEP_COUNT.set(conf, pieceCount.getCount()); --- End diff -- shouldn't it be pieceCount.getCount() + 1 ? > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998636#comment-15998636 ] ASF GitHub Bot commented on GIRAPH-1146: Github user ikabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115049146 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/EmptyBlock.java --- @@ -36,4 +36,9 @@ @Override public void forAllPossiblePieces(Consumer consumer) { } + + @Override + public PieceCount getPieceCount() { +return new PieceCount(1); --- End diff -- should be 0 > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998670#comment-15998670 ] ASF GitHub Bot commented on GIRAPH-1146: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115053845 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/PieceCount.java --- @@ -0,0 +1,86 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.giraph.block_app.framework.block; + +import com.google.common.base.Objects; + +/** + * Number of pieces + */ +public class PieceCount { + private boolean known; + private int count; + + public PieceCount(int count) { +known = true; +this.count = count; + } + + private PieceCount() { +known = false; + } + + public static PieceCount createUnknownCount() { +return new PieceCount(); + } + + + public PieceCount add(PieceCount other) { +if (!this.known || !other.known) { + known = false; +} else { + count += other.count; +} +return this; + } + + public PieceCount multiply(int value) { +count *= value; +return this; + } + + public int getCount() { +return known ? count : Integer.MAX_VALUE; --- End diff -- Good point, I'll throw instead > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998668#comment-15998668 ] ASF GitHub Bot commented on GIRAPH-1146: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115053732 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/BlockUtils.java --- @@ -147,6 +148,11 @@ public static void initAndCheckConfig(GiraphConfiguration conf) { checkBlockTypes( executionBlock, blockFactory.createExecutionStage(immConf), immConf); +PieceCount pieceCount = executionBlock.getPieceCount(); +if (pieceCount.isKnown()) { + GiraphConstants.SUPERSTEP_COUNT.set(conf, pieceCount.getCount()); --- End diff -- There will be X+1 supersteps, but they are going to be supersteps 0..X. Actually I can make +1 here and then -1 in the logging part to keep it clear. > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003732#comment-16003732 ] ASF GitHub Bot commented on GIRAPH-1146: Github user ikabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/36#discussion_r115624215 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/piece/delegate/DelegatePiece.java --- @@ -249,6 +250,11 @@ public void forAllPossiblePieces(Consumer consumer) { } } + @Override --- End diff -- you don't need to extend, default is 1 for any piece :) > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible
[ https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004770#comment-16004770 ] ASF GitHub Bot commented on GIRAPH-1146: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/36 > Keep track of number of supersteps when possible > > > Key: GIRAPH-1146 > URL: https://issues.apache.org/jira/browse/GIRAPH-1146 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In many cases we know how many supersteps are there going to be. We can keep > track of it and log it with progress. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004955#comment-16004955 ] ASF GitHub Bot commented on GIRAPH-1138: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/37#discussion_r115788093 --- Diff: giraph-core/src/main/java/org/apache/giraph/utils/ProgressableUtils.java --- @@ -270,8 +270,16 @@ public static void awaitSemaphorePermits(final Semaphore semaphore, // Try to get result from the future result = entry.getValue().get( MSEC_TO_WAIT_ON_EACH_FUTURE, TimeUnit.MILLISECONDS); -} catch (InterruptedException | ExecutionException e) { - throw new IllegalStateException("Exception occurred", e); +} catch (InterruptedException e) { + throw new IllegalStateException("Interrupted", e); +} catch (ExecutionException e) { + // Execution exception wraps the actual cause + if (e.getCause() instanceof RuntimeException) { --- End diff -- Is it ever possible that e.getCause() is null? > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004968#comment-16004968 ] ASF GitHub Bot commented on GIRAPH-1138: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/37#discussion_r115789355 --- Diff: giraph-core/src/main/java/org/apache/giraph/utils/ProgressableUtils.java --- @@ -270,8 +270,16 @@ public static void awaitSemaphorePermits(final Semaphore semaphore, // Try to get result from the future result = entry.getValue().get( MSEC_TO_WAIT_ON_EACH_FUTURE, TimeUnit.MILLISECONDS); -} catch (InterruptedException | ExecutionException e) { - throw new IllegalStateException("Exception occurred", e); +} catch (InterruptedException e) { + throw new IllegalStateException("Interrupted", e); +} catch (ExecutionException e) { + // Execution exception wraps the actual cause + if (e.getCause() instanceof RuntimeException) { --- End diff -- Apparently a null check is not needed, http://stackoverflow.com/questions/2950319/is-null-check-needed-before-calling-instanceof > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004970#comment-16004970 ] ASF GitHub Bot commented on GIRAPH-1138: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/37 Looks good to me. > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service
[ https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004999#comment-16004999 ] ASF GitHub Bot commented on GIRAPH-1138: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/37 > Don't wrap exceptions from executor service > --- > > Key: GIRAPH-1138 > URL: https://issues.apache.org/jira/browse/GIRAPH-1138 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In ProgressableUtils.getResultsWithNCallables we wrap exceptions from > underlying threads, making logs hard to read. We should re-throw original > exception when possible. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015991#comment-16015991 ] ASF GitHub Bot commented on GIRAPH-1147: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/38 [GIRAPH-1147] Store timestamps when various fractions of input were done Summary: In order to evaluate how read stragglers affect job performance, add a way to expose timestamps when various fractions of input were done reading through counters. Test Plan: Ran a big job and verified counters are set correctly. You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph inputCounters Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit 56305ecfeb223313fc63e331cb815bfdc2430731 Author: Maja Kabiljo Date: 2017-05-18T15:30:05Z [GIRAPH-1147] Store timestamps when various fractions of input were done Summary: In order to evaluate how read stragglers affect job performance, add a way to expose timestamps when various fractions of input were done reading through counters. Test Plan: Ran a big job and verified counters are set correctly. > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016531#comment-16016531 ] ASF GitHub Bot commented on GIRAPH-1147: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/38 As opposed to timestamps why not set the counters to the time passed between the different fractions? That's going to be easier to parse quickly. > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016535#comment-16016535 ] ASF GitHub Bot commented on GIRAPH-1147: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/38#discussion_r117364342 --- Diff: giraph-core/src/main/java/org/apache/giraph/master/input/MasterInputSplitsHandler.java --- @@ -56,16 +69,39 @@ /** Latches to say when one input splits type is ready to be accessed */ private Map latchesMap = new EnumMap<>(InputType.class); + /** Context for accessing counters */ + private final Mapper.Context context; + /** How many splits per type are there total */ + private final Map numSplitsPerType = + new EnumMap<>(InputType.class); + /** How many splits per type have been read so far */ + private final Map numSplitsReadPerType = + new EnumMap<>(InputType.class); + /** + * Store in counters timestamps when we finished reading + * these fractions of input + */ + private final double[] doneFractionsToStoreInCoutners; --- End diff -- Typo in field name. > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016604#comment-16016604 ] ASF GitHub Bot commented on GIRAPH-1147: Github user majakabiljo commented on the issue: https://github.com/apache/giraph/pull/38 Updated with comments and tested again > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016675#comment-16016675 ] ASF GitHub Bot commented on GIRAPH-1147: Github user dlogothetis commented on the issue: https://github.com/apache/giraph/pull/38 Looks ok to me. Btw, did you squash the commits? You don't need to, they are squashed automatically when the pull request is merged. And we don't loose the diff between updates. > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016742#comment-16016742 ] ASF GitHub Bot commented on GIRAPH-1147: Github user majakabiljo commented on the issue: https://github.com/apache/giraph/pull/38 Ah didn't know, will do in the future > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done
[ https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016758#comment-16016758 ] ASF GitHub Bot commented on GIRAPH-1147: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/38 > Store timestamps when various fractions of input were done > -- > > Key: GIRAPH-1147 > URL: https://issues.apache.org/jira/browse/GIRAPH-1147 > Project: Giraph > Issue Type: New Feature >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo >Priority: Minor > > In order to evaluate how read stragglers affect job performance, add a way to > expose timestamps when various fractions of input were done reading through > counters. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029759#comment-16029759 ] ASF GitHub Bot commented on GIRAPH-1148: GitHub user majakabiljo opened a pull request: https://github.com/apache/giraph/pull/39 [GIRAPH-1148] Connected components - make calculate sizes work with l… …arge number of components Summary: Currently if we have a graph with large number of connected components, calculating connected components sizes fails because reducer becomes too large. Use array of handles instead. Test Plan: Successfully ran the job which was failing without this change You can merge this pull request into a Git repository by running: $ git pull https://github.com/majakabiljo/giraph cc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/39.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #39 commit 0b9ef9415a67737122e6fb6caeca5307df9632a1 Author: Maja Kabiljo Date: 2017-05-30T17:10:17Z [GIRAPH-1148] Connected components - make calculate sizes work with large number of components Summary: Currently if we have a graph with large number of connected components, calculating connected components sizes fails because reducer becomes too large. Use array of handles instead. Test Plan: Successfully ran the job which was failing without this change > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031358#comment-16031358 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 ping @edunov @majakabiljo > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033356#comment-16033356 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119680240 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static + + Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperation reduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducers reducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- This means that the same ReduceOperation instance is going to be shared across the different handles. Not entirely sure how the (de)serialization will work here. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033358#comment-16033358 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119681042 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static + + Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperation reduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducers reducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- Actually, because you're using an ArrayOfHandles, this is going to be serialized-deserialized as a whole. This should be ok. Though, this also assumes that reduce operations are stateless. In general, they are but there's nothing that prevents from writing a stateful one. Maybe add a comment about this. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033370#comment-16033370 ] ASF GitHub Bot commented on GIRAPH-1139: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r119682732 --- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java --- @@ -288,6 +289,9 @@ public BspService( throw new RuntimeException(e); } +this.taskId = (int)getApplicationAttempt() * conf.getMaxWorkers() + conf.getTaskPartition(); --- End diff -- Please fix checkstyle (make sure 'mvn verify' passes) > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033371#comment-16033371 ] ASF GitHub Bot commented on GIRAPH-1139: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r119682567 --- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java --- @@ -272,7 +273,7 @@ public BspService( } if (LOG.isInfoEnabled()) { LOG.info("BspService: Connecting to ZooKeeper with job " + jobId + - ", " + getTaskPartition() + " on " + serverPortList); + ", " + getTaskId() + " on " + serverPortList); --- End diff -- We need to set taskId before this > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033404#comment-16033404 ] ASF GitHub Bot commented on GIRAPH-1148: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119686989 --- Diff: giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java --- @@ -320,6 +325,91 @@ public String toString() { } /** + * Like reduceAndBroadcast, but uses array of handles for reducers and + * broadcasts, to make it feasible and performant when values are large. + * Each supplied value to reduce will be reduced in the handle defined by + * handleHashSupplier%numHandles + * + * @param Single value type, objects passed on workers + * @param Reduced value type + * @param Vertex id type + * @param Vertex value type + * @param Edge value type + */ + public static + + Piece reduceAndBroadcastWithArrayOfHandles( + final String name, + final int numHandles, + final ReduceOperation reduceOp, + final SupplierFromVertex handleHashSupplier, + final SupplierFromVertex valueSupplier, + final ConsumerWithVertex reducedValueConsumer) { +return new Piece() { + protected ArrayOfHandles.ArrayOfReducers reducers; + protected BroadcastArrayHandle broadcasts; + + private int getHandleIndex(Vertex vertex) { +return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles); + } + + @Override + public void registerReducers( + final CreateReducersApi reduceApi, Object executionStage) { +reducers = new ArrayOfHandles.ArrayOfReducers<>( +numHandles, +new Supplier>() { + @Override + public ReducerHandle get() { +return reduceApi.createLocalReducer(reduceOp); --- End diff -- Good catch, it didn't occur to me. I'll fix it not to reuse the same ReduceOperation object. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033539#comment-16033539 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119708329 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( Pair componentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- Should this be configurable? > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033817#comment-16033817 ] ASF GitHub Bot commented on GIRAPH-1148: Github user majakabiljo commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119744463 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( Pair componentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- I can't come up with a reason why someone would want to change it. This can start having problems only at trillion components which wouldn't work for many other reasons, for tiny ones this few reducers won't add any overhead, and for larger ones which were currently working this is still improvement since reducers are processed on many machines now. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033823#comment-16033823 ] ASF GitHub Bot commented on GIRAPH-1148: Github user dlogothetis commented on a diff in the pull request: https://github.com/apache/giraph/pull/39#discussion_r119745185 --- Diff: giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java --- @@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes( Pair componentToReducePair = Pair.of( new LongWritable(), new LongWritable(1)); LongWritable reusableLong = new LongWritable(); -return Pieces.reduceAndBroadcast( -"CalcConnectedComponentSizes", +// This reduce operation is stateless so we can use a single instance +BasicMapReduce reduceOperation = new BasicMapReduce<>( -LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG), +LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG); +return Pieces.reduceAndBroadcastWithArrayOfHandles( +"CalcConnectedComponentSizes", +3137, /* Just using some large prime number */ --- End diff -- Sounds good. Looks good then. > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components
[ https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035065#comment-16035065 ] ASF GitHub Bot commented on GIRAPH-1148: Github user asfgit closed the pull request at: https://github.com/apache/giraph/pull/39 > Connected components - make calculate sizes work with large number of > components > > > Key: GIRAPH-1148 > URL: https://issues.apache.org/jira/browse/GIRAPH-1148 > Project: Giraph > Issue Type: Improvement >Reporter: Maja Kabiljo >Assignee: Maja Kabiljo > > Currently if we have a graph with large number of connected components, > calculating connected components sizes fails because reducer becomes too > large. Use array of handles instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049379#comment-16049379 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on a diff in the pull request: https://github.com/apache/giraph/pull/30#discussion_r122006839 --- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java --- @@ -272,7 +273,7 @@ public BspService( } if (LOG.isInfoEnabled()) { LOG.info("BspService: Connecting to ZooKeeper with job " + jobId + - ", " + getTaskPartition() + " on " + serverPortList); + ", " + getTaskId() + " on " + serverPortList); --- End diff -- We can't actually set `taskId` before creating the ZK client, since we need ZK to get the get the application attempt. I changed this to log the partition instead, which we can get from `conf`. > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049382#comment-16049382 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Fixed @majakabiljo's comments. @edunov > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (GIRAPH-1151) Avoid message value factory initialization in Apache Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130051#comment-16130051 ] ASF GitHub Bot commented on GIRAPH-1151: GitHub user yukselakinci opened a pull request: https://github.com/apache/giraph/pull/42 Avoid message value factory initialization in Apache Giraph Jira ticket id: https://issues.apache.org/jira/browse/GIRAPH-1151 You can merge this pull request into a Git repository by running: $ git pull https://github.com/yukselakinci/giraph avoidfactoryinit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/giraph/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 commit 1216cde8326753c5a6fafa315cab2d59c7c7d7d0 Author: Yuksel Akinci Date: 2017-08-16T23:31:51Z Avoid message value factory initialization in Apache Giraph Jira ticket id: https://issues.apache.org/jira/browse/GIRAPH-1151 > Avoid message value factory initialization in Apache Giraph > --- > > Key: GIRAPH-1151 > URL: https://issues.apache.org/jira/browse/GIRAPH-1151 > Project: Giraph > Issue Type: Improvement >Reporter: Yuksel Akinci > Original Estimate: 168h > Remaining Estimate: 168h > > Messages in Giraph are instantiated using a "message value factory" class. > Currently the message value factory gets instantiated every time a message is > sent, which is unnecessary and can cause high overhead if the message value > factory constructor contains expensive operation. Factory objects are be > saved to avoid repeated expensive object creations. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (GIRAPH-1151) Avoid message value factory initialization in Apache Giraph
[ https://issues.apache.org/jira/browse/GIRAPH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137400#comment-16137400 ] ASF GitHub Bot commented on GIRAPH-1151: Github user yukselakinci closed the pull request at: https://github.com/apache/giraph/pull/42 > Avoid message value factory initialization in Apache Giraph > --- > > Key: GIRAPH-1151 > URL: https://issues.apache.org/jira/browse/GIRAPH-1151 > Project: Giraph > Issue Type: Improvement >Reporter: Yuksel Akinci > Original Estimate: 168h > Remaining Estimate: 168h > > Messages in Giraph are instantiated using a "message value factory" class. > Currently the message value factory gets instantiated every time a message is > sent, which is unnecessary and can cause high overhead if the message value > factory constructor contains expensive operation. Factory objects are be > saved to avoid repeated expensive object creations. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work
[ https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138529#comment-16138529 ] ASF GitHub Bot commented on GIRAPH-1139: Github user neggert commented on the issue: https://github.com/apache/giraph/pull/30 Ping @edunov > Resuming from checkpoint doesn't work > - > > Key: GIRAPH-1139 > URL: https://issues.apache.org/jira/browse/GIRAPH-1139 > Project: Giraph > Issue Type: Bug > Components: bsp >Affects Versions: 1.2.0 >Reporter: Nic Eggert > > I ran into a couple of issues when trying to get Giraph to resume from > checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker). > * If we just wrote a checkpoint, the master expects the workers to checkpoint > again, while the workers (correctly) clear the checkpointing flag. > * When workers restart, they take their task id from the partition number, > which stays the same across multiple attempts. This gets transferred to the > Netty clientId, and the server starts ignoring messages from restarted > workers because it thinks it processed them already. > I believe I've fixed these issues. I'll send a GitHub PR shortly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)