from:"\"ASF GitHub Bot \\\(JIRA\\\)\""

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15723824#comment-15723824
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


GitHub user heslami opened a pull request:

https://github.com/apache/giraph/pull/12

[GIRAPH-1125] Memory estimation mechanism for more efficient OOC execution

**Thanks to Dionysios Logothetis for helping a lot with this diff.**

The new out-of-core mechanism is designed with the adaptivity goal in mind, 
meaning that we wanted the out-of-core mechanism to kick in only when it is 
necessary. In other words, when the amount of data (graph, messages, and 
mutations) all fit in memory, we want to take advantage of the entire memory. 
And, when in a stage the memory is short, only enough (minimal) amount of data 
goes out of core (to disk). This ensures a good performance for the out-of-core 
mechanism.

To satisfy the adaptiveness goal, we need to know how much memory is used 
at each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
get memory information based on JVM's internal methods (Runtime's 
freeMemory()). This method is inaccurate (and pessimistic), meaning that it 
does not account for garbage data that has not been purged by GC. Using JVM's 
default methods, OOC behaves pessimistically and move data out of core even if 
it is not necessary. For instance,
consider the case where there are a lot of garbage on the heap, but GC has 
not happened for a while. In this case, the default OOC pushes data on disk and 
immediately after a major GC it brings back the data to memory. This causes 
inefficiency in the default out of core mechanism. If out-of-core is used, but 
the data can entirely fit in memory, the job goes out of core even though going 
out of core is not necessary.

To address this issue, we need to have a mechanism to more accurately know 
how much of heap is filled with non-garbage data. Consequently, we need to 
change the Oracle (OOC policy) to take advantage of a more accurate memory 
usage estimation.

In this diff, we introduce a mechanism to estimate the amount of memory 
used by non-garbage data on the heap at each point of time. This estimation is 
based on the fact that Giraph is a data-parallel system in its essence, meaning 
that several types of threads exist, each type doing the same computation on 
various data. More specifically, we have compute/input threads, communication 
(receiving) threads, and OOC-IO threads. In a normal uniform execution, each 
type of threads behave
similarly and contribute similarly to each other on the memory footprint 
(meaning that different compute threads contribute similarly to each other on 
the memory footprint). In the proposed approach, we use a measure of progress 
for each type of thread and use linear regression to estimate the amount of 
memory.

The measure of progress for compute threads is the total number of vertices 
they have collectively processed in a superstep at each point, the measure of 
progress for communication threads is the total number of bytes received by a 
worker up to each point, and the measure of progress for IO threads is the 
amount of data read/written to disk up to each point during a superstep. These 
measures are restarted at the beginning of each superstep. We use these 
measures at the point where full GC
happens (when we have the accurate estimation of non-garbage data on the 
heap) and devise the linear model of used memory. We then use the linear model 
to estimate the amount of memory at each time based on the above progress 
measures.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heslami/giraph ooc-memory-estimation

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/12.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #12


commit 4d2b7c96da94aa26cddf75655dd6cb7d93bf9b1d
Author: Hassan Eslami 
Date:   2016-12-05T23:37:29Z

[GIRAPH-1125] Memory estimation mechanism for more efficient OOC execution

The new out-of-core mechanism is designed with the adaptivity goal in mind, 
meaning that we wanted out-of-core mechanism to kick in only when it is 
necessary. In other words, when the amount of data (graph, messages, and 
mutations) all fit in memory, we want to take advantage of the entire memory. 
And, when in a stage the memory is short, only enough (minimal) amount of data 
goes out of core (to disk). This ensures a good performance for the out-of-core 
mechanism.

To satisfy the adaptiveness goal, we need to know how much memory is used 
at each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
ge

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15749458#comment-15749458
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92487131
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/utils/ThreadLocalProgressCounter.java
 ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.utils;
+
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Makes a list of {@link ProgressCounter} accessible through
+ * a {@link ThreadLocal}.
+ */
+public class ThreadLocalProgressCounter extends 
ThreadLocal {
+  /**
+   * List of counters.
+   */
+  private final List counters = new ArrayList<>();
+
+  /**
+   * Initializes a new counter, adds it to the list of counters
+   * and returns it.
+   * @return Progress counter.
+   */
+  @Override
+  protected ProgressCounter initialValue() {
+ProgressCounter threadCounter = new ProgressCounter();
+synchronized (counters) {
+  counters.add(threadCounter);
+}
+return threadCounter;
+  }
+
+  /**
+   * Sums the progress of all counters.
+   * @return Sum of all counters
+   */
+  public long getProgress() {
+long progress = 0;
+synchronized (counters) {
+  for (ProgressCounter entry : counters) {
+progress += entry.getValue();
+  }
+}
+return progress;
+  }
+
+  /**
+   * Removes all counters.
+   */
+  public void reset() {
--- End diff --

What is the purpose of this function and how do you use it? 


> Add memory estimation mechanism to out-of-core
> --
>
> Key: GIRAPH-1125
> URL: https://issues.apache.org/jira/browse/GIRAPH-1125
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> The new out-of-core mechanism is designed with the adaptivity goal in mind, 
> meaning that we wanted out-of-core mechanism to kick in only when it is 
> necessary. In other words, when the amount of data (graph, messages, and 
> mutations) all fit in memory, we want to take advantage of the entire memory. 
> And, when in a stage the memory is short, only enough (minimal) amount of 
> data goes out of core (to disk). This ensures a good performance for the 
> out-of-core mechanism.
> To satisfy the adaptiveness goal, we need to know how much memory is used at 
> each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
> get memory information based on JVM's internal methods (Runtime's 
> freeMemory()). This method is inaccurate (and pessimistic), meaning that it 
> does not account for garbage data that has not been purged by GC. Using JVM's 
> default methods, OOC behaves pessimistically and move data out of core even 
> if it is not necessary. For instance, consider the case where there are a lot 
> of garbage on the heap, but GC has not happened for a while. In this case, 
> the default OOC pushes data on disk and immediately after a major GC it 
> brings back the data to memory. This causes inefficiency in the default out 
> of core mechanism. If out-of-core is used but the data can entirely fit in 
> memory, the job goes out of core even though going out of core is not 
> necessary.
> To address this issue, we need to have a mechanism to more accurately know 
> how much of heap is filled with non-garbage data. Consequently, we need to 
> change the Oracle (OOC policy) to take advantage of a more accurate memory 
> usage estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755780#comment-15755780
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92903138
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755794#comment-15755794
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92903902
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755824#comment-15755824
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92904852
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755842#comment-15755842
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92905726
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755874#comment-15755874
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92906967
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15755878#comment-15755878
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r92907194
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If me

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762437#comment-15762437
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93129286
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762464#comment-15762464
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93131260
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762554#comment-15762554
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93137918
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762637#comment-15762637
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93143346
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-19 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15762795#comment-15762795
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93152281
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764777#comment-15764777
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93288153
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15764788#comment-15764788
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user heslami commented on a diff in the pull request:

https://github.com/apache/giraph/pull/12#discussion_r93289032
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/ooc/policy/MemoryEstimatorOracle.java
 ---
@@ -0,0 +1,851 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.giraph.ooc.policy;
+
+import com.sun.management.GarbageCollectionNotificationInfo;
+import org.apache.commons.math.stat.regression.OLSMultipleLinearRegression;
+import org.apache.giraph.comm.NetworkMetrics;
+import org.apache.giraph.conf.FloatConfOption;
+import org.apache.giraph.conf.ImmutableClassesGiraphConfiguration;
+import org.apache.giraph.conf.LongConfOption;
+import org.apache.giraph.edge.AbstractEdgeStore;
+import org.apache.giraph.ooc.OutOfCoreEngine;
+import org.apache.giraph.ooc.command.IOCommand;
+import org.apache.giraph.ooc.command.LoadPartitionIOCommand;
+import org.apache.giraph.ooc.command.WaitIOCommand;
+import org.apache.giraph.worker.EdgeInputSplitsCallable;
+import org.apache.giraph.worker.VertexInputSplitsCallable;
+import org.apache.giraph.worker.WorkerProgress;
+import org.apache.log4j.Logger;
+
+import java.lang.management.ManagementFactory;
+import java.lang.management.MemoryPoolMXBean;
+import java.lang.management.MemoryUsage;
+import java.util.List;
+import java.util.Map;
+import java.util.Vector;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.Lock;
+import java.util.concurrent.locks.ReentrantLock;
+
+import static com.google.common.base.Preconditions.checkState;
+
+/**
+ * Implementation of {@link OutOfCoreOracle} that uses a linear regression 
model
+ * to estimate actual memory usage based on the current state of 
computation.
+ * The model takes into consideration 5 parameters:
+ *
+ * y = c0 + c1*x1 + c2*x2 + c3*x3 + c4*x4 + c5*x5
+ *
+ * y: memory usage
+ * x1: edges loaded
+ * x2: vertices loaded
+ * x3: vertices processed
+ * x4: bytes received due to messages
+ * x5: bytes loaded/stored from/to disk due to OOC.
+ *
+ */
+public class MemoryEstimatorOracle implements OutOfCoreOracle {
+  /** Memory check interval in msec */
+  public static final LongConfOption CHECK_MEMORY_INTERVAL =
+new LongConfOption("giraph.garbageEstimator.checkMemoryInterval", 1000,
+"The interval where memory checker thread wakes up and " +
+"monitors memory footprint (in milliseconds)");
+  /**
+   * If mem-usage is above this threshold and no Full GC has been called,
+   * we call it manually
+   */
+  public static final FloatConfOption MANUAL_GC_MEMORY_PRESSURE =
+new FloatConfOption("giraph.garbageEstimator.manualGCPressure", 0.95f,
+"The threshold above which GC is called manually if Full GC has 
not " +
+"happened in a while");
+  /** Used to detect a high memory pressure situation */
+  public static final FloatConfOption GC_MINIMUM_RECLAIM_FRACTION =
+new FloatConfOption("giraph.garbageEstimator.gcReclaimFraction", 0.05f,
+"Minimum percentage of memory we expect to be reclaimed after a 
Full " +
+"GC. If less than this amount is reclaimed, it is sage to say 
" +
+"we are in a high memory situation and the estimation 
mechanism " +
+"has not recognized it yet!");
+  /** If mem-usage is above this threshold, active threads are set to 0 */
+  public static final FloatConfOption AM_HIGH_THRESHOLD =
+new FloatConfOption("giraph.amHighThreshold", 0.95f,
+"If mem-usage is above this threshold, all active threads " +
+"(compute/input) are paused.");
+  /** If m

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15771345#comment-15771345
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/12
  
Tests done for this diff:
-  Snapshots tests (those that were big enough to go out-of-core) both with 
the new and the old mechanism.
- Run 4 production jobs and verified that performance is better than the 
previous mechanism.


> Add memory estimation mechanism to out-of-core
> --
>
> Key: GIRAPH-1125
> URL: https://issues.apache.org/jira/browse/GIRAPH-1125
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> The new out-of-core mechanism is designed with the adaptivity goal in mind, 
> meaning that we wanted out-of-core mechanism to kick in only when it is 
> necessary. In other words, when the amount of data (graph, messages, and 
> mutations) all fit in memory, we want to take advantage of the entire memory. 
> And, when in a stage the memory is short, only enough (minimal) amount of 
> data goes out of core (to disk). This ensures a good performance for the 
> out-of-core mechanism.
> To satisfy the adaptiveness goal, we need to know how much memory is used at 
> each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
> get memory information based on JVM's internal methods (Runtime's 
> freeMemory()). This method is inaccurate (and pessimistic), meaning that it 
> does not account for garbage data that has not been purged by GC. Using JVM's 
> default methods, OOC behaves pessimistically and move data out of core even 
> if it is not necessary. For instance, consider the case where there are a lot 
> of garbage on the heap, but GC has not happened for a while. In this case, 
> the default OOC pushes data on disk and immediately after a major GC it 
> brings back the data to memory. This causes inefficiency in the default out 
> of core mechanism. If out-of-core is used but the data can entirely fit in 
> memory, the job goes out of core even though going out of core is not 
> necessary.
> To address this issue, we need to have a mechanism to more accurately know 
> how much of heap is filled with non-garbage data. Consequently, we need to 
> change the Oracle (OOC policy) to take advantage of a more accurate memory 
> usage estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15773389#comment-15773389
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/12


> Add memory estimation mechanism to out-of-core
> --
>
> Key: GIRAPH-1125
> URL: https://issues.apache.org/jira/browse/GIRAPH-1125
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> The new out-of-core mechanism is designed with the adaptivity goal in mind, 
> meaning that we wanted out-of-core mechanism to kick in only when it is 
> necessary. In other words, when the amount of data (graph, messages, and 
> mutations) all fit in memory, we want to take advantage of the entire memory. 
> And, when in a stage the memory is short, only enough (minimal) amount of 
> data goes out of core (to disk). This ensures a good performance for the 
> out-of-core mechanism.
> To satisfy the adaptiveness goal, we need to know how much memory is used at 
> each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
> get memory information based on JVM's internal methods (Runtime's 
> freeMemory()). This method is inaccurate (and pessimistic), meaning that it 
> does not account for garbage data that has not been purged by GC. Using JVM's 
> default methods, OOC behaves pessimistically and move data out of core even 
> if it is not necessary. For instance, consider the case where there are a lot 
> of garbage on the heap, but GC has not happened for a while. In this case, 
> the default OOC pushes data on disk and immediately after a major GC it 
> brings back the data to memory. This causes inefficiency in the default out 
> of core mechanism. If out-of-core is used but the data can entirely fit in 
> memory, the job goes out of core even though going out of core is not 
> necessary.
> To address this issue, we need to have a mechanism to more accurately know 
> how much of heap is filled with non-garbage data. Consequently, we need to 
> change the Oracle (OOC policy) to take advantage of a more accurate memory 
> usage estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1125) Add memory estimation mechanism to out-of-core

2016-12-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15781296#comment-15781296
 ] 

ASF GitHub Bot commented on GIRAPH-1125:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/12
  
Somehow the file MemoryEstimatorOracle.java wasn't committed. I pulled the 
most recent changes, I can see the commit log (see below), but this file is 
missing.

commit f5b685efa09b539b1f95925405723f7ac7b1dcea
Author: Hassan Eslami 
Date:   Fri Dec 23 12:03:37 2016 -0600

GIRAPH-1125

Closes #12





> Add memory estimation mechanism to out-of-core
> --
>
> Key: GIRAPH-1125
> URL: https://issues.apache.org/jira/browse/GIRAPH-1125
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> The new out-of-core mechanism is designed with the adaptivity goal in mind, 
> meaning that we wanted out-of-core mechanism to kick in only when it is 
> necessary. In other words, when the amount of data (graph, messages, and 
> mutations) all fit in memory, we want to take advantage of the entire memory. 
> And, when in a stage the memory is short, only enough (minimal) amount of 
> data goes out of core (to disk). This ensures a good performance for the 
> out-of-core mechanism.
> To satisfy the adaptiveness goal, we need to know how much memory is used at 
> each point of time. The default out-of-core mechanism (ThresholdBasedOracle) 
> get memory information based on JVM's internal methods (Runtime's 
> freeMemory()). This method is inaccurate (and pessimistic), meaning that it 
> does not account for garbage data that has not been purged by GC. Using JVM's 
> default methods, OOC behaves pessimistically and move data out of core even 
> if it is not necessary. For instance, consider the case where there are a lot 
> of garbage on the heap, but GC has not happened for a while. In this case, 
> the default OOC pushes data on disk and immediately after a major GC it 
> brings back the data to memory. This causes inefficiency in the default out 
> of core mechanism. If out-of-core is used but the data can entirely fit in 
> memory, the job goes out of core even though going out of core is not 
> necessary.
> To address this issue, we need to have a mechanism to more accurately know 
> how much of heap is filled with non-garbage data. Consequently, we need to 
> change the Oracle (OOC policy) to take advantage of a more accurate memory 
> usage estimation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822101#comment-15822101
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


GitHub user ikabiljo opened a pull request:

https://github.com/apache/giraph/pull/14

Generate more useful functional interfaces

GIRAPH-1129

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ikabiljo/giraph gen_prim

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/14.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #14


commit 7ba49ecf6e4999aa7fb73d3b97dba8d86eacbc4a
Author: Igor Kabiljo 
Date:   2017-01-13T09:06:53Z

Generate more useful functional interfaces




> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822147#comment-15822147
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/13


> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822204#comment-15822204
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/14


> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15822498#comment-15822498
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


GitHub user ikabiljo opened a pull request:

https://github.com/apache/giraph/pull/15

Allow Pieces that don't go over vertices, but over "recepients of messages"

GIRAPH-1129

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ikabiljo/giraph no_vtx

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/15.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15


commit ee1f5ee3aeacaac7422c61189b5b3e03f7b31db8
Author: Igor Kabiljo 
Date:   2017-01-13T23:03:34Z

Allow Pieces that don't go over vertices, but over "recepients of messages"




> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1130) Fix RepeatUntilBlock

2017-01-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832198#comment-15832198
 ] 

ASF GitHub Bot commented on GIRAPH-1130:


GitHub user ikabiljo opened a pull request:

https://github.com/apache/giraph/pull/16

Fix RepeatUntilBlock

https://issues.apache.org/jira/browse/GIRAPH-1130

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ikabiljo/giraph fix_repeat

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/16.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #16


commit adc653fc11644585c4d0d79001c2a7eaebd4cf62
Author: Igor Kabiljo 
Date:   2017-01-20T17:07:03Z

Fix RepeatUntilBlock




> Fix RepeatUntilBlock
> 
>
> Key: GIRAPH-1130
> URL: https://issues.apache.org/jira/browse/GIRAPH-1130
> Project: Giraph
>  Issue Type: Bug
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832250#comment-15832250
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/15


> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1129) SocialHash

2017-01-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15832718#comment-15832718
 ] 

ASF GitHub Bot commented on GIRAPH-1129:


GitHub user ikabiljo opened a pull request:

https://github.com/apache/giraph/pull/17

Extending generated code and adding ShortWritable

- adding ShortWritable
- adding all combinations for T2TFunction
- making BasicSet/Map being generated - no code changes to implementations, 
just generating byte/short as well

GIRAPH-1129

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ikabiljo/giraph primitives

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/17.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17


commit 1488a03a1ab72a2d42fe93a29c0330f04ea4304c
Author: Igor Kabiljo 
Date:   2017-01-17T19:03:48Z

Extending generated code and adding ShortWritable




> SocialHash
> --
>
> Key: GIRAPH-1129
> URL: https://issues.apache.org/jira/browse/GIRAPH-1129
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>
> Hypergraph graph partitioner



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1130) Fix RepeatUntilBlock

2017-01-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15843372#comment-15843372
 ] 

ASF GitHub Bot commented on GIRAPH-1130:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/16


> Fix RepeatUntilBlock
> 
>
> Key: GIRAPH-1130
> URL: https://issues.apache.org/jira/browse/GIRAPH-1130
> Project: Giraph
>  Issue Type: Bug
>Reporter: Igor Kabiljo
>Assignee: Igor Kabiljo
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (GIRAPH-1043) Implementation of Darwini graph generator

2017-02-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850914#comment-15850914
 ] 

ASF GitHub Bot commented on GIRAPH-1043:


GitHub user edunov opened a pull request:

https://github.com/apache/giraph/pull/19

GIRAPH-1043 Implementation of Darwini graph generator



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/edunov/giraph darwini

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/19.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19


commit 08aac32cf8440825921cb1eea935ce5b76c0a784
Author: Sergey Edunov 
Date:   2016-03-10T21:48:47Z

Darwini graph generator




> Implementation of Darwini graph generator
> -
>
> Key: GIRAPH-1043
> URL: https://issues.apache.org/jira/browse/GIRAPH-1043
> Project: Giraph
>  Issue Type: Task
>Reporter: Sergey Edunov
>Assignee: Sergey Edunov
>
> Implementation of graph generator that is able to capture many properties of 
> social graphs, such as high local clustering coefficient, non-power law 
> degree distributions and log normal joint degree distribution. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts

2017-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891109#comment-15891109
 ] 

ASF GitHub Bot commented on GIRAPH-1132:


GitHub user edunov opened a pull request:

https://github.com/apache/giraph/pull/21

GIRAPH-1132 Giraph jobs don't end if zookeeper dies before job starts

I'm not sure I set all the timeouts right. There is no way to test all of 
these either. 
The idea is that we shouldn't have infinite wait loops anywhere. And that's 
exactly what this diff does

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/edunov/giraph timeout

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/21.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21


commit cdbe7d4a46d80611fb5846eeeab37b94e66781a1
Author: Sergey Edunov 
Date:   2017-03-01T21:41:37Z

GIRAPH-1132 Giraph jobs don't end if zookeeper dies before job starts




> Giraph jobs don't end if zookeeper dies before job starts
> -
>
> Key: GIRAPH-1132
> URL: https://issues.apache.org/jira/browse/GIRAPH-1132
> Project: Giraph
>  Issue Type: Bug
>Reporter: Sergey Edunov
>
> There are multiple places in the Giraph code where we waitForever() on some 
> event (e.g. all workers to finish or zookeeper to come up). This is in 
> general bad, as any issue on other side may become undetected and make job 
> run forever. We need to introduce timeout to these waits



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts

2017-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891144#comment-15891144
 ] 

ASF GitHub Bot commented on GIRAPH-1132:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/21#discussion_r103799656
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/conf/GiraphConstants.java ---
@@ -1238,5 +1239,27 @@
   BooleanConfOption PREFER_IP_ADDRESSES =
   new BooleanConfOption("giraph.preferIP", false,
   "Prefer IP addresses instead of host names");
+
+  /**
+   * Timeout for "waitForever", when we need to wait for zookeeper.
+   * Since we should never really have to wait forever.
+   * We should only wait some reasonable but large amount of time.
+   */
+  LongConfOption WAIT_FOREVER_ZOOKEEPER_TIMEOUT_MSEC =
--- End diff --

Nit: since we are not waiting forever anymore, I'd drop word forever from 
everywhere (forever and timeout have opposite meaning :-))


> Giraph jobs don't end if zookeeper dies before job starts
> -
>
> Key: GIRAPH-1132
> URL: https://issues.apache.org/jira/browse/GIRAPH-1132
> Project: Giraph
>  Issue Type: Bug
>Reporter: Sergey Edunov
>
> There are multiple places in the Giraph code where we waitForever() on some 
> event (e.g. all workers to finish or zookeeper to come up). This is in 
> general bad, as any issue on other side may become undetected and make job 
> run forever. We need to introduce timeout to these waits



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1132) Giraph jobs don't end if zookeeper dies before job starts

2017-03-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15891157#comment-15891157
 ] 

ASF GitHub Bot commented on GIRAPH-1132:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/21


> Giraph jobs don't end if zookeeper dies before job starts
> -
>
> Key: GIRAPH-1132
> URL: https://issues.apache.org/jira/browse/GIRAPH-1132
> Project: Giraph
>  Issue Type: Bug
>Reporter: Sergey Edunov
>
> There are multiple places in the Giraph code where we waitForever() on some 
> event (e.g. all workers to finish or zookeeper to come up). This is in 
> general bad, as any issue on other side may become undetected and make job 
> run forever. We need to introduce timeout to these waits



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1133) Fix JobProgressTracker in OverrideExceptionHandler

2017-03-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15894934#comment-15894934
 ] 

ASF GitHub Bot commented on GIRAPH-1133:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/22

GIRAPH-1133: Fix JobProgressTracker in OverrideExceptionHandler

Summary: We create OverrideExceptionHandler before JobProgressTracker, so 
it can't report errors to command line.

Test Plan: Ran a job with exception caught by OverrideExceptionHandler 
before and after the change

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph jobProgress

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/22.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22


commit 47c11d65b5f740f374053e9b842e498e7802e919
Author: Maja Kabiljo 
Date:   2017-03-03T19:40:19Z

GIRAPH-1133: Fix JobProgressTracker in OverrideExceptionHandler

Summary: We create OverrideExceptionHandler before JobProgressTracker, so 
it can't report errors to command line.

Test Plan: Ran a job with exception caught by OverrideExceptionHandler 
before and after the change




> Fix JobProgressTracker in OverrideExceptionHandler
> --
>
> Key: GIRAPH-1133
> URL: https://issues.apache.org/jira/browse/GIRAPH-1133
> Project: Giraph
>  Issue Type: Bug
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> We create OverrideExceptionHandler before JobProgressTracker, so it can't 
> report errors to command line.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1134) Track number of input splits in command line

2017-03-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899815#comment-15899815
 ] 

ASF GitHub Bot commented on GIRAPH-1134:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/24

GIRAPH-1134: Track number of input splits in command line

Summary: The progress we track during input reports how much data have we 
read, but not how much data there is to read.

Test Plan: Now it prints something like:
 Loading data: 70046603 vertices loaded, 658 vertex input splits loaded 
(out of 659); 7122062703 edges loaded, 4 edge input splits loaded (out of 1322)

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph numSplits

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/24.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #24


commit 091dc7bcb819139f517f37c67f7dc0748965a1b8
Author: Maja Kabiljo 
Date:   2017-03-07T17:32:13Z

GIRAPH-1134: Track number of input splits in command line

Summary: The progress we track during input reports how much data have we 
read, but not how much data there is to read.

Test Plan: Now it prints something like:
 Loading data: 70046603 vertices loaded, 658 vertex input splits loaded 
(out of 659); 7122062703 edges loaded, 4 edge input splits loaded (out of 1322)




> Track number of input splits in command line
> 
>
> Key: GIRAPH-1134
> URL: https://issues.apache.org/jira/browse/GIRAPH-1134
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> The progress we track during input reports how much data have we read, but 
> not how much data there is to read.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1133) Fix JobProgressTracker in OverrideExceptionHandler

2017-03-07 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900132#comment-15900132
 ] 

ASF GitHub Bot commented on GIRAPH-1133:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/22


> Fix JobProgressTracker in OverrideExceptionHandler
> --
>
> Key: GIRAPH-1133
> URL: https://issues.apache.org/jira/browse/GIRAPH-1133
> Project: Giraph
>  Issue Type: Bug
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> We create OverrideExceptionHandler before JobProgressTracker, so it can't 
> report errors to command line.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1043) Implementation of Darwini graph generator

2017-03-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15924190#comment-15924190
 ] 

ASF GitHub Bot commented on GIRAPH-1043:


Github user cheonruen commented on the issue:

https://github.com/apache/giraph/pull/19
  
Do you have any tutorial for this?


> Implementation of Darwini graph generator
> -
>
> Key: GIRAPH-1043
> URL: https://issues.apache.org/jira/browse/GIRAPH-1043
> Project: Giraph
>  Issue Type: Task
>Reporter: Sergey Edunov
>Assignee: Sergey Edunov
>
> Implementation of graph generator that is able to capture many properties of 
> social graphs, such as high local clustering coefficient, non-power law 
> degree distributions and log normal joint degree distribution. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15928741#comment-15928741
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


GitHub user heslami opened a pull request:

https://github.com/apache/giraph/pull/26

[GIRAPH-1137] Remove channel probing from Netty worker thread for 
credit-based flow…

In credit-based flow-control, sometimes, client threads (one type of Netty 
worker threads used in Giraph) try to send requests to other workers. This is 
bad practice for Netty and can cause Netty to mark the execution as 
deadlock-prone (an example exception shown below). Client threads should only 
be responsible for sending ACK/NACK messages in response to requests, and they 
should do so by reuseing the channel from which they received the request. In 
the current implementation, client threads may try to send unsent/cached 
requests in credit-based flow control. Sending such requests should be 
delegated to other threads.

WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
io.netty.util.concurrent.BlockingOperationException: 
DefaultChannelPromise@2c455378(incomplete)
at 
io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
at 
io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
at 
org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
at 
org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
at 
org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
at 
org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
at 
org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
at 
org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
at 
org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
at 
org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
at 
org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
at 
org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
at 
org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
at 
io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
at 
io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
at 
org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
at 
io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
at 
io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
at 
io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
at java.lang.Thread.run(Thread.java:745)


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/heslami/giraph fix-credit-based

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/26.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #26


commit 4c8186cc8097877d5af20ef054630a629caaa026
Author: Hassan Eslami 
Date:   2017-03-16T19:52:12Z

Remove channel probing from Netty worker thread for credit-based 
flow-control

Closes GIRAPH-1137




> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
>

[jira] [Commented] (GIRAPH-1134) Track number of input splits in command line

2017-03-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930356#comment-15930356
 ] 

ASF GitHub Bot commented on GIRAPH-1134:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/24


> Track number of input splits in command line
> 
>
> Key: GIRAPH-1134
> URL: https://issues.apache.org/jira/browse/GIRAPH-1134
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> The progress we track during input reports how much data have we read, but 
> not how much data there is to read.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933077#comment-15933077
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/26
  
There are a couple of checkstyle errors:


giraph/giraph-core/target/munged/main/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java:48:8:
 Unused import - java.util.concurrent.LinkedBlockingQueue.

giraph/giraph-core/target/munged/main/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java:172:3:
 Missing a Javadoc comment.




> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-20 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15933911#comment-15933911
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/26
  
Tests done:
- Snapshot tests.
- Performance on large PageRank benchmark remains the same.
- Performance on internal prod job remains the same.


> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934801#comment-15934801
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/26
  
Error was reproduced by decreasing the netty client threads.


> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15934816#comment-15934816
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/26#discussion_r107200544
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/comm/flow_control/CreditBasedFlowControl.java
 ---
@@ -215,10 +231,38 @@ public void run() {
 }
   }
 });
-thread.setUncaughtExceptionHandler(exceptionHandler);
-thread.setName("resume-sender");
-thread.setDaemon(true);
-thread.start();
+resumeHandlerThread.setUncaughtExceptionHandler(exceptionHandler);
+resumeHandlerThread.setName("resume-sender");
+resumeHandlerThread.setDaemon(true);
+resumeHandlerThread.start();
+
+// Thread to handle/send cached requests
+Thread cachedRequestHandlerThread = new Thread(new Runnable() {
+  @Override
+  public void run() {
+while (true) {
+  Pair pair = null;
+  try {
+pair = toBeSent.take();
+  } catch (InterruptedException e) {
+throw new IllegalStateException("run: failed while waiting to 
" +
+"take an element from the request queue!", e);
+  }
+  int taskId = pair.getLeft();
+  WritableRequest request = pair.getRight();
+  nettyClient.doSend(taskId, request);
+  if (aggregateUnsentRequests.decrementAndGet() == 0) {
+synchronized (aggregateUnsentRequests) {
+  aggregateUnsentRequests.notifyAll();
+}
+  }
+}
+  }
+});
+
cachedRequestHandlerThread.setUncaughtExceptionHandler(exceptionHandler);
+cachedRequestHandlerThread.setName("cached-req-sender");
+cachedRequestHandlerThread.setDaemon(true);
+cachedRequestHandlerThread.start();
--- End diff --

You can create a utility method like ThreadUtils.startThread with exception 
handler.


> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.ne

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-03-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936985#comment-15936985
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/27

[GIRAPH-1138] Don't wrap exceptions from executor service

Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions 
from underlying threads, making logs hard to read. We should re-throw original 
exception when possible.

Test Plan: Ran a job which fails in one of input threads before and after 
change, verified exception is clear now

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph exceptions

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/27.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #27


commit a460f582d9b8aa875cb32ca4f8dae1c320e9ac23
Author: Maja Kabiljo 
Date:   2017-03-22T19:39:19Z

[GIRAPH-1138] Don't wrap exceptions from executor service

Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions 
from underlying threads, making logs hard to read. We should re-throw original 
exception when possible.

Test Plan: Ran a job which fails in one of input threads before and after 
change, verified exception is clear now




> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943777#comment-15943777
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/26


> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-03-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943980#comment-15943980
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


GitHub user neggert opened a pull request:

https://github.com/apache/giraph/pull/30

[GIRAPH-1139] Fix resuming from checkpoint

A couple of fixes that get resuming from checkpoint working.

* Set checkpointStatus to NONE in master when restarting from checkpoint.

Workers already do this, so the job hangs when restarting from checkpoint
while the master waits for workers to create checkpoints they're never
going to create.

* Set unique task id for each worker attempt

Previously, a worker would reuse the task id from the prior attempt. This
gets propagated to the Netty client id, which makes the master think it has
already processed any requests that come from that client, causing it to
discard them. This obviously causes problems.

And also a fix for GIRAPH-1136. We will now checkpoint on superstep 0 if 
checkpointing is enabled. Let me know if you'd rather I sent a separate PR for 
this.

Testing:
Ran custom Label Propagation implementation with checkpointing on a ~5b 
node graph. Manually killed workers (by logging in to worker node and running 
`kill -9 `. Ensured that Giraph successfully resumed from most recent 
checkpoint.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/neggert/giraph trunk_resume_fixes

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/30.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #30


commit 6462d48e46d84cd6aa5ecd5817b0d057ce3a6c1f
Author: NicEggert 
Date:   2017-03-23T20:23:03Z

Set checkpointStatus to NONE in master when restarting from checkpoint.

Workers already do this, so the job hangs when restarting from checkpoint
while the master waits for workers to create checkpoints they're never
going to create.

commit 3ed8c18a3bc97c910e364bf7d48d50be25df704c
Author: NicEggert 
Date:   2017-03-23T20:26:02Z

Checkpoint on superstep 0 if checkpointing is enabled

commit 74bba4573dbb77242d81352f84969b114db1cb71
Author: NicEggert 
Date:   2017-03-23T20:26:47Z

Set unique task id for each worker attempt

Previously, a worker would reuse the task id from the prior attempt. This
gets propagated to the Netty client id, which makes the master think it has
already processed any requests that come from that client, causing it to
discard them. This obviously causes problems.




> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1137) Remove channel probing from Netty worker thread for credit-based flow-control

2017-03-29 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1137?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15947619#comment-15947619
 ] 

ASF GitHub Bot commented on GIRAPH-1137:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/26
  
@heslami, findbugs reports this:

[INFO] [INFO] Synchronization performed on 
java.util.concurrent.atomic.AtomicInteger in 
org.apache.giraph.comm.flow_control.CreditBasedFlowControl$2.run() 
["org.apache.giraph.comm.flow_control.CreditBasedFlowControl$2"] At 
CreditBasedFlowControl.java:[lines 237-256]


> Remove channel probing from Netty worker thread for credit-based flow-control
> -
>
> Key: GIRAPH-1137
> URL: https://issues.apache.org/jira/browse/GIRAPH-1137
> Project: Giraph
>  Issue Type: Bug
>Reporter: Hassan Eslami
>Assignee: Hassan Eslami
>
> In credit-based flow-control, sometimes, client threads (one type of Netty 
> worker threads used in Giraph) try to send requests to other workers. This is 
> bad practice for Netty and can cause Netty to mark the execution as 
> deadlock-prone (an example exception shown below). Client threads should only 
> be responsible for sending ACK/NACK messages in response to requests, and 
> they should do so by reuseing the channel from which they received the 
> request. In the current implementation, client threads may try to send 
> unsent/cached requests in credit-based flow control. Sending such requests 
> should be delegated to other threads.
> WARN 2017-03-08 06:06:22,104 [netty-client-worker-3] 
> io.netty.util.concurrent.BlockingOperationException: 
> DefaultChannelPromise@2c455378(incomplete)
> at 
> io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:383)
> at 
> io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
> at io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:343)
> at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:259)
> at 
> org.apache.giraph.utils.ProgressableUtils$ChannelFutureWaitable.waitFor(ProgressableUtils.java:461)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitFor(ProgressableUtils.java:214)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:180)
> at 
> org.apache.giraph.utils.ProgressableUtils.waitForever(ProgressableUtils.java:165)
> at 
> org.apache.giraph.utils.ProgressableUtils.awaitChannelFuture(ProgressableUtils.java:132)
> at 
> org.apache.giraph.comm.netty.NettyClient.getNextChannel(NettyClient.java:715)
> at 
> org.apache.giraph.comm.netty.NettyClient.writeRequestToChannel(NettyClient.java:799)
> at org.apache.giraph.comm.netty.NettyClient.doSend(NettyClient.java:789)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.trySendCachedRequests(CreditBasedFlowControl.java:515)
> at 
> org.apache.giraph.comm.flow_control.CreditBasedFlowControl.messageAckReceived(CreditBasedFlowControl.java:485)
> at 
> org.apache.giraph.comm.netty.NettyClient.messageReceived(NettyClient.java:840)
> at 
> org.apache.giraph.comm.netty.handler.ResponseClientHandler.channelRead(ResponseClientHandler.java:87)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:153)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> org.apache.giraph.comm.netty.InboundByteCounter.channelRead(InboundByteCounter.java:89)
> at 
> io.netty.channel.DefaultChannelHandlerContext.invokeChannelRead(DefaultChannelHandlerContext.java:338)
> at 
> io.netty.channel.DefaultChannelHandlerContext.fireChannelRead(DefaultChannelHandlerContext.java:324)
> at 
> io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:785)
> at 
> io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:126)
> at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:485)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:452)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:346)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:101)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1140) Cleanup temp files in hdfs after job is done

2017-03-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949473#comment-15949473
 ] 

ASF GitHub Bot commented on GIRAPH-1140:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/32

GIRAPH-1140: Cleanup temp files in hdfs after job is done

Summary: Currently we are not cleaning up temp files we create in hdfs, fix 
it.

Test Plan: Ran a few jobs (successful, failed, killed), verified files are 
removed in all cases.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph zkCleanup

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/32.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #32


commit d42386e8430d0d66a49b7163e9579b9dd1b88747
Author: Maja Kabiljo 
Date:   2017-03-30T17:32:35Z

GIRAPH-1140: Cleanup temp files in hdfs after job is done

Summary: Currently we are not cleaning up temp files we create in hdfs, fix 
it.

Test Plan: Ran a few jobs (successful, failed, killed), verified files are 
removed in all cases.




> Cleanup temp files in hdfs after job is done
> 
>
> Key: GIRAPH-1140
> URL: https://issues.apache.org/jira/browse/GIRAPH-1140
> Project: Giraph
>  Issue Type: Bug
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently we are not cleaning up temp files we create in hdfs, fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-03-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15949737#comment-15949737
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/27


> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1140) Cleanup temp files in hdfs after job is done

2017-03-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951611#comment-15951611
 ] 

ASF GitHub Bot commented on GIRAPH-1140:


Github user majakabiljo closed the pull request at:

https://github.com/apache/giraph/pull/32


> Cleanup temp files in hdfs after job is done
> 
>
> Key: GIRAPH-1140
> URL: https://issues.apache.org/jira/browse/GIRAPH-1140
> Project: Giraph
>  Issue Type: Bug
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently we are not cleaning up temp files we create in hdfs, fix it.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made

2017-03-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951612#comment-15951612
 ] 

ASF GitHub Bot commented on GIRAPH-1141:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/33

GIRAPH-1141: Kill the job if no progress is being made

Summary: Sometimes jobs can get stuck for various reasons, it's better to 
have an option to kill them then to keep them running holding resources.

Test Plan: Ran a large job with shorter limit and verified it gets killed. 
Also ran normal successful job. mvn verify

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph progress

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/33.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #33


commit 7b571bbbefade67b3b77273e6b291318399d810e
Author: Maja Kabiljo 
Date:   2017-03-31T20:40:48Z

GIRAPH-1141: Kill the job if no progress is being made

Summary: Sometimes jobs can get stuck for various reasons, it's better to 
have an option to kill them then to keep them running holding resources.

Test Plan: Ran a large job with shorter limit and verified it gets killed. 
Also ran normal successful job. mvn verify




> Kill the job if no progress is being made
> -
>
> Key: GIRAPH-1141
> URL: https://issues.apache.org/jira/browse/GIRAPH-1141
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> Sometimes jobs can get stuck for various reasons, it's better to have an 
> option to kill them then to keep them running holding resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made

2017-03-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951648#comment-15951648
 ] 

ASF GitHub Bot commented on GIRAPH-1141:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/33#discussion_r109250592
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/job/CombinedWorkerProgress.java ---
@@ -204,4 +215,19 @@ public String toString() {
 }
 return sb.toString();
   }
+
+  /**
+   * Check if this instance made progress from another instance
+   *
+   * @param lastProgress Instance to compare with
+   * @return True iff progress was made
+   */
+  public boolean madeProgressFrom(CombinedWorkerProgress lastProgress) {
--- End diff --

Why not use the underlying raw numbers instead of the string? For instance, 
small changes in memory may not really mean progress.


> Kill the job if no progress is being made
> -
>
> Key: GIRAPH-1141
> URL: https://issues.apache.org/jira/browse/GIRAPH-1141
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> Sometimes jobs can get stuck for various reasons, it's better to have an 
> option to kill them then to keep them running holding resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made

2017-03-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15951657#comment-15951657
 ] 

ASF GitHub Bot commented on GIRAPH-1141:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/33#discussion_r109251665
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/job/CombinedWorkerProgress.java ---
@@ -204,4 +215,19 @@ public String toString() {
 }
 return sb.toString();
   }
+
+  /**
+   * Check if this instance made progress from another instance
+   *
+   * @param lastProgress Instance to compare with
+   * @return True iff progress was made
+   */
+  public boolean madeProgressFrom(CombinedWorkerProgress lastProgress) {
--- End diff --

That's why I separated getProgressString from toString, to only contain 
actual progress. For different supersteps we are looking at different numbers 
so this seemed the easiest to compare instead of having all the if-s.


> Kill the job if no progress is being made
> -
>
> Key: GIRAPH-1141
> URL: https://issues.apache.org/jira/browse/GIRAPH-1141
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> Sometimes jobs can get stuck for various reasons, it's better to have an 
> option to kill them then to keep them running holding resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made

2017-04-03 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953731#comment-15953731
 ] 

ASF GitHub Bot commented on GIRAPH-1141:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/33
  
i missed this detail. looks ok to me.


> Kill the job if no progress is being made
> -
>
> Key: GIRAPH-1141
> URL: https://issues.apache.org/jira/browse/GIRAPH-1141
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> Sometimes jobs can get stuck for various reasons, it's better to have an 
> option to kill them then to keep them running holding resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1141) Kill the job if no progress is being made

2017-04-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955369#comment-15955369
 ] 

ASF GitHub Bot commented on GIRAPH-1141:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/33


> Kill the job if no progress is being made
> -
>
> Key: GIRAPH-1141
> URL: https://issues.apache.org/jira/browse/GIRAPH-1141
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> Sometimes jobs can get stuck for various reasons, it's better to have an 
> option to kill them then to keep them running holding resources.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963189#comment-15963189
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Could someone please take a look at this? @majakabiljo @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969747#comment-15969747
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on the issue:

https://github.com/apache/giraph/pull/30
  
Hi Nic, thank you for fixing this and sorry for the delay. 
I'll take a look at this diff other the weekend


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971306#comment-15971306
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user edunov commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r111767933
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
@@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long 
superstep) {
 if (checkpointFrequency == 0) {
   return CheckpointStatus.NONE;
 }
-long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
+long firstCheckpoint = INPUT_SUPERSTEP + 1;
--- End diff --

What is the reason for changing this? Do you want it to always do 
checkpoint after the first superstep?


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971315#comment-15971315
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
I'm finding that roughly half of my ~2 hour job time is spent just loading 
the graph. Resuming from checkpoint lets me skip that step if something fails. 
This is also why I want to make sure that Giraph checkpoints before starting 
superstep 0. (It used to work this way, and it's documented in the Giraph book.)


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971326#comment-15971326
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r111771953
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java ---
@@ -1734,7 +1735,7 @@ private CheckpointStatus getCheckpointStatus(long 
superstep) {
 if (checkpointFrequency == 0) {
   return CheckpointStatus.NONE;
 }
-long firstCheckpoint = INPUT_SUPERSTEP + 1 + checkpointFrequency;
+long firstCheckpoint = INPUT_SUPERSTEP + 1;
--- End diff --

This will actually checkpoint before running superstep 0. You're basically 
checkpointing the input loading work.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971460#comment-15971460
 ] 

ASF GitHub Bot commented on GIRAPH-1143:


GitHub user aching opened a pull request:

https://github.com/apache/giraph/pull/35

GIRAPH-1143

Handle non-legacy Hadoop job errors nicely.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aching/giraph giraph-1143

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/35.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #35


commit cad037526c7b41ff46c4de824dac205c349a20ba
Author: Avery Ching 
Date:   2017-04-17T18:30:17Z

GIRAPH-1143




>  org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy 
> Hadoop clusters
> -
>
> Key: GIRAPH-1143
> URL: https://issues.apache.org/jira/browse/GIRAPH-1143
> Project: Giraph
>  Issue Type: Bug
>Reporter: Avery Ching
>Assignee: Avery Ching
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971506#comment-15971506
 ] 

ASF GitHub Bot commented on GIRAPH-1143:


Github user aching commented on the issue:

https://github.com/apache/giraph/pull/35
  
Merged.


>  org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy 
> Hadoop clusters
> -
>
> Key: GIRAPH-1143
> URL: https://issues.apache.org/jira/browse/GIRAPH-1143
> Project: Giraph
>  Issue Type: Bug
>Reporter: Avery Ching
>Assignee: Avery Ching
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1143) org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy Hadoop clusters

2017-04-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15971505#comment-15971505
 ] 

ASF GitHub Bot commented on GIRAPH-1143:


Github user aching closed the pull request at:

https://github.com/apache/giraph/pull/35


>  org.apache.hadoop.mapreduce.JobID.forName job id parsing fails on non legacy 
> Hadoop clusters
> -
>
> Key: GIRAPH-1143
> URL: https://issues.apache.org/jira/browse/GIRAPH-1143
> Project: Giraph
>  Issue Type: Bug
>Reporter: Avery Ching
>Assignee: Avery Ching
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-21 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15979370#comment-15979370
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on the issue:

https://github.com/apache/giraph/pull/30
  
Can we get rid of getHostnamePartitionId() to avoid incorrectly using it in 
the future? I see various other places where taskPartition is used for 
identifier, do any of them need to be updated too?


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980471#comment-15980471
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
There actually is [one instance][1] where I think it's okay to use 
`getHostnamePartitionId`. This happens in `BspServiceMaster.becomeMaster` when 
creating a master bid in ZK, before any `TaskInfo` instance is created.

This does make me realize, though, that I need to make the same change to 
how the task id is set in `BspServiceMaster`.

What about just changing how `taskPartition` is set in `BspService`, like 
so?

this.taskPartition = (int)getApplicationAttempt() * 
conf.getMaxWorkers() + getTaskPartition();

I think this is actually the minimal code change to fix the issue. I don't 
see anywhere in the code that actually cares about the task partition as 
anything other than a unique identifier.

[1]: 
https://github.com/apache/giraph/blob/trunk/giraph-core/src/main/java/org/apache/giraph/master/BspServiceMaster.java#L805


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-04-24 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981327#comment-15981327
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
This is ready for another look. I've replaced partition id with task id. 
The only place that actually needs a partition id is logging missing workers in 
`BspServiceMaster`. In that case, it's easy enough to recover the partition id 
by taking the task id modulo the number of workers.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997364#comment-15997364
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/36

GIRAPH-1146: Keep track of number of supersteps when possible

Summary: In many cases we know how many supersteps are there going to be. 
We can keep track of it and log it with progress.

Test Plan: Ran a job, example log line:
 Data from 3 workers - Compute superstep 5 (out of 6): 171824 out of 
1304814 vertices computed; 19 out of 252 partitions computed

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph numSupersteps

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/36.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #36


commit 4a4d69b674d327f3cd092036a7e3fec4ffb4c2ce
Author: Maja Kabiljo 
Date:   2017-05-04T20:07:22Z

GIRAPH-1146: Keep track of number of supersteps when possible

Summary: In many cases we know how many supersteps are there going to be. 
We can keep track of it and log it with progress.

Test Plan: Ran a job, example log line:
 Data from 3 workers - Compute superstep 5 (out of 6): 171824 out of 
1304814 vertices computed; 19 out of 252 partitions computed




> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-05-04 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997373#comment-15997373
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/37

[GIRAPH-1138] Don't wrap exceptions from executor service

Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions 
from underlying threads, making logs hard to read. We should re-throw original 
exception when possible. (accidentally closed #27)

Test Plan: Ran a job which fails in one of input threads before and after 
change, verified exception is clear now

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph exceptions2

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit c859833de5a28e580b38dc46a8b371cd82b6b6d0
Author: Maja Kabiljo 
Date:   2017-05-04T20:15:44Z

[GIRAPH-1138] Don't wrap exceptions from executor service

Summary: In ProgressableUtils.getResultsWithNCallables we wrap exceptions 
from underlying threads, making logs hard to read. We should re-throw original 
exception when possible. (accidentally closed #27)

Test Plan: Ran a job which fails in one of input threads before and after 
change, verified exception is clear now




> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998634#comment-15998634
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user ikabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115049370
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/PieceCount.java
 ---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.giraph.block_app.framework.block;
+
+import com.google.common.base.Objects;
+
+/**
+ * Number of pieces
+ */
+public class PieceCount {
+  private boolean known;
+  private int count;
+
+  public PieceCount(int count) {
+known = true;
+this.count = count;
+  }
+
+  private PieceCount() {
+known = false;
+  }
+
+  public static PieceCount createUnknownCount() {
+return new PieceCount();
+  }
+
+
+  public PieceCount add(PieceCount other) {
+if (!this.known || !other.known) {
+  known = false;
+} else {
+  count += other.count;
+}
+return this;
+  }
+
+  public PieceCount multiply(int value) {
+count *= value;
+return this;
+  }
+
+  public int getCount() {
+return known ? count : Integer.MAX_VALUE;
--- End diff --

this might easily lead to overflow if anything is done with this number.

You should either fatal (better), or return 1M or something here.


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998637#comment-15998637
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user ikabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115049666
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/piece/delegate/DelegatePiece.java
 ---
@@ -249,6 +250,15 @@ public void 
forAllPossiblePieces(Consumer consumer) {
 }
   }
 
+  @Override
+  public PieceCount getPieceCount() {
+PieceCount ret = new PieceCount(0);
--- End diff --

this should be 1, this executes all pieces simultaneously.


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998635#comment-15998635
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user ikabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115049981
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/BlockUtils.java
 ---
@@ -147,6 +148,11 @@ public static void 
initAndCheckConfig(GiraphConfiguration conf) {
 checkBlockTypes(
 executionBlock, blockFactory.createExecutionStage(immConf), 
immConf);
 
+PieceCount pieceCount = executionBlock.getPieceCount();
+if (pieceCount.isKnown()) {
+  GiraphConstants.SUPERSTEP_COUNT.set(conf, pieceCount.getCount());
--- End diff --

shouldn't it be pieceCount.getCount() + 1 ? 


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998636#comment-15998636
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user ikabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115049146
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/EmptyBlock.java
 ---
@@ -36,4 +36,9 @@
   @Override
   public void forAllPossiblePieces(Consumer consumer) {
   }
+
+  @Override
+  public PieceCount getPieceCount() {
+return new PieceCount(1);
--- End diff --

should be 0


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998670#comment-15998670
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115053845
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/block/PieceCount.java
 ---
@@ -0,0 +1,86 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.giraph.block_app.framework.block;
+
+import com.google.common.base.Objects;
+
+/**
+ * Number of pieces
+ */
+public class PieceCount {
+  private boolean known;
+  private int count;
+
+  public PieceCount(int count) {
+known = true;
+this.count = count;
+  }
+
+  private PieceCount() {
+known = false;
+  }
+
+  public static PieceCount createUnknownCount() {
+return new PieceCount();
+  }
+
+
+  public PieceCount add(PieceCount other) {
+if (!this.known || !other.known) {
+  known = false;
+} else {
+  count += other.count;
+}
+return this;
+  }
+
+  public PieceCount multiply(int value) {
+count *= value;
+return this;
+  }
+
+  public int getCount() {
+return known ? count : Integer.MAX_VALUE;
--- End diff --

Good point, I'll throw instead


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-05 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15998668#comment-15998668
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115053732
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/BlockUtils.java
 ---
@@ -147,6 +148,11 @@ public static void 
initAndCheckConfig(GiraphConfiguration conf) {
 checkBlockTypes(
 executionBlock, blockFactory.createExecutionStage(immConf), 
immConf);
 
+PieceCount pieceCount = executionBlock.getPieceCount();
+if (pieceCount.isKnown()) {
+  GiraphConstants.SUPERSTEP_COUNT.set(conf, pieceCount.getCount());
--- End diff --

There will be X+1 supersteps, but they are going to be supersteps 0..X. 
Actually I can make +1 here and then -1 in the logging part to keep it clear.


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-09 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003732#comment-16003732
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user ikabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/36#discussion_r115624215
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/framework/piece/delegate/DelegatePiece.java
 ---
@@ -249,6 +250,11 @@ public void 
forAllPossiblePieces(Consumer consumer) {
 }
   }
 
+  @Override
--- End diff --

you don't need to extend, default is 1 for any piece :)


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1146) Keep track of number of supersteps when possible

2017-05-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004770#comment-16004770
 ] 

ASF GitHub Bot commented on GIRAPH-1146:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/36


> Keep track of number of supersteps when possible
> 
>
> Key: GIRAPH-1146
> URL: https://issues.apache.org/jira/browse/GIRAPH-1146
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In many cases we know how many supersteps are there going to be. We can keep 
> track of it and log it with progress.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-05-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004955#comment-16004955
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/37#discussion_r115788093
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/utils/ProgressableUtils.java ---
@@ -270,8 +270,16 @@ public static void awaitSemaphorePermits(final 
Semaphore semaphore,
   // Try to get result from the future
   result = entry.getValue().get(
   MSEC_TO_WAIT_ON_EACH_FUTURE, TimeUnit.MILLISECONDS);
-} catch (InterruptedException | ExecutionException e) {
-  throw new IllegalStateException("Exception occurred", e);
+} catch (InterruptedException e) {
+  throw new IllegalStateException("Interrupted", e);
+} catch (ExecutionException e) {
+  // Execution exception wraps the actual cause
+  if (e.getCause() instanceof RuntimeException) {
--- End diff --

Is it ever possible that e.getCause() is null?


> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-05-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004968#comment-16004968
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/37#discussion_r115789355
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/utils/ProgressableUtils.java ---
@@ -270,8 +270,16 @@ public static void awaitSemaphorePermits(final 
Semaphore semaphore,
   // Try to get result from the future
   result = entry.getValue().get(
   MSEC_TO_WAIT_ON_EACH_FUTURE, TimeUnit.MILLISECONDS);
-} catch (InterruptedException | ExecutionException e) {
-  throw new IllegalStateException("Exception occurred", e);
+} catch (InterruptedException e) {
+  throw new IllegalStateException("Interrupted", e);
+} catch (ExecutionException e) {
+  // Execution exception wraps the actual cause
+  if (e.getCause() instanceof RuntimeException) {
--- End diff --

Apparently a null check is not needed, 
http://stackoverflow.com/questions/2950319/is-null-check-needed-before-calling-instanceof


> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-05-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004970#comment-16004970
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/37
  
Looks good to me.


> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1138) Don't wrap exceptions from executor service

2017-05-10 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16004999#comment-16004999
 ] 

ASF GitHub Bot commented on GIRAPH-1138:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/37


> Don't wrap exceptions from executor service
> ---
>
> Key: GIRAPH-1138
> URL: https://issues.apache.org/jira/browse/GIRAPH-1138
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In ProgressableUtils.getResultsWithNCallables we wrap exceptions from 
> underlying threads, making logs hard to read. We should re-throw original 
> exception when possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015991#comment-16015991
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/38

[GIRAPH-1147] Store timestamps when various fractions of input were done

Summary: In order to evaluate how read stragglers affect job performance, 
add a way to expose timestamps when various fractions of input were done 
reading through counters.

Test Plan: Ran a big job and verified counters are set correctly.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph inputCounters

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/38.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #38


commit 56305ecfeb223313fc63e331cb815bfdc2430731
Author: Maja Kabiljo 
Date:   2017-05-18T15:30:05Z

[GIRAPH-1147] Store timestamps when various fractions of input were done

Summary: In order to evaluate how read stragglers affect job performance, 
add a way to expose timestamps when various fractions of input were done 
reading through counters.

Test Plan: Ran a big job and verified counters are set correctly.




> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016531#comment-16016531
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/38
  
As opposed to timestamps why not set the counters to the time passed 
between the different fractions? That's going to be easier to parse quickly.


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016535#comment-16016535
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/38#discussion_r117364342
  
--- Diff: 
giraph-core/src/main/java/org/apache/giraph/master/input/MasterInputSplitsHandler.java
 ---
@@ -56,16 +69,39 @@
   /** Latches to say when one input splits type is ready to be accessed */
   private Map latchesMap =
   new EnumMap<>(InputType.class);
+  /** Context for accessing counters */
+  private final Mapper.Context context;
+  /** How many splits per type are there total */
+  private final Map numSplitsPerType =
+  new EnumMap<>(InputType.class);
+  /** How many splits per type have been read so far */
+  private final Map numSplitsReadPerType =
+  new EnumMap<>(InputType.class);
+  /**
+   * Store in counters timestamps when we finished reading
+   * these fractions of input
+   */
+  private final double[] doneFractionsToStoreInCoutners;
--- End diff --

Typo in field name.


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016604#comment-16016604
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user majakabiljo commented on the issue:

https://github.com/apache/giraph/pull/38
  
Updated with comments and tested again


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016675#comment-16016675
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user dlogothetis commented on the issue:

https://github.com/apache/giraph/pull/38
  
Looks ok to me.

Btw, did you squash the commits? You don't need to, they are squashed 
automatically when the pull request is merged. And we don't loose the diff 
between updates.


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016742#comment-16016742
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user majakabiljo commented on the issue:

https://github.com/apache/giraph/pull/38
  
Ah didn't know, will do in the future


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1147) Store timestamps when various fractions of input were done

2017-05-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16016758#comment-16016758
 ] 

ASF GitHub Bot commented on GIRAPH-1147:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/38


> Store timestamps when various fractions of input were done
> --
>
> Key: GIRAPH-1147
> URL: https://issues.apache.org/jira/browse/GIRAPH-1147
> Project: Giraph
>  Issue Type: New Feature
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>Priority: Minor
>
> In order to evaluate how read stragglers affect job performance, add a way to 
> expose timestamps when various fractions of input were done reading through 
> counters.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-05-30 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029759#comment-16029759
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


GitHub user majakabiljo opened a pull request:

https://github.com/apache/giraph/pull/39

[GIRAPH-1148] Connected components - make calculate sizes work with l…

…arge number of components

Summary: Currently if we have a graph with large number of connected 
components, calculating connected components sizes fails because reducer 
becomes too large. Use array of handles instead.

Test Plan: Successfully ran the job which was failing without this change

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/majakabiljo/giraph cc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #39


commit 0b9ef9415a67737122e6fb6caeca5307df9632a1
Author: Maja Kabiljo 
Date:   2017-05-30T17:10:17Z

[GIRAPH-1148] Connected components - make calculate sizes work with large 
number of components

Summary: Currently if we have a graph with large number of connected 
components, calculating connected components sizes fails because reducer 
becomes too large. Use array of handles instead.

Test Plan: Successfully ran the job which was failing without this change




> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-05-31 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16031358#comment-16031358
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
ping @edunov @majakabiljo 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033356#comment-16033356
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119680240
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java 
---
@@ -320,6 +325,91 @@ public String toString() {
   }
 
   /**
+   * Like reduceAndBroadcast, but uses array of handles for reducers and
+   * broadcasts, to make it feasible and performant when values are large.
+   * Each supplied value to reduce will be reduced in the handle defined by
+   * handleHashSupplier%numHandles
+   *
+   * @param  Single value type, objects passed on workers
+   * @param  Reduced value type
+   * @param  Vertex id type
+   * @param  Vertex value type
+   * @param  Edge value type
+   */
+  public static
+  
+  Piece reduceAndBroadcastWithArrayOfHandles(
+  final String name,
+  final int numHandles,
+  final ReduceOperation reduceOp,
+  final SupplierFromVertex handleHashSupplier,
+  final SupplierFromVertex valueSupplier,
+  final ConsumerWithVertex reducedValueConsumer) {
+return new Piece() {
+  protected ArrayOfHandles.ArrayOfReducers reducers;
+  protected BroadcastArrayHandle broadcasts;
+
+  private int getHandleIndex(Vertex vertex) {
+return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles);
+  }
+
+  @Override
+  public void registerReducers(
+  final CreateReducersApi reduceApi, Object executionStage) {
+reducers = new ArrayOfHandles.ArrayOfReducers<>(
+numHandles,
+new Supplier>() {
+  @Override
+  public ReducerHandle get() {
+return reduceApi.createLocalReducer(reduceOp);
--- End diff --

This means that the same ReduceOperation instance is going to be shared 
across the different handles. Not entirely sure how the (de)serialization will 
work here. 


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033358#comment-16033358
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119681042
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java 
---
@@ -320,6 +325,91 @@ public String toString() {
   }
 
   /**
+   * Like reduceAndBroadcast, but uses array of handles for reducers and
+   * broadcasts, to make it feasible and performant when values are large.
+   * Each supplied value to reduce will be reduced in the handle defined by
+   * handleHashSupplier%numHandles
+   *
+   * @param  Single value type, objects passed on workers
+   * @param  Reduced value type
+   * @param  Vertex id type
+   * @param  Vertex value type
+   * @param  Edge value type
+   */
+  public static
+  
+  Piece reduceAndBroadcastWithArrayOfHandles(
+  final String name,
+  final int numHandles,
+  final ReduceOperation reduceOp,
+  final SupplierFromVertex handleHashSupplier,
+  final SupplierFromVertex valueSupplier,
+  final ConsumerWithVertex reducedValueConsumer) {
+return new Piece() {
+  protected ArrayOfHandles.ArrayOfReducers reducers;
+  protected BroadcastArrayHandle broadcasts;
+
+  private int getHandleIndex(Vertex vertex) {
+return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles);
+  }
+
+  @Override
+  public void registerReducers(
+  final CreateReducersApi reduceApi, Object executionStage) {
+reducers = new ArrayOfHandles.ArrayOfReducers<>(
+numHandles,
+new Supplier>() {
+  @Override
+  public ReducerHandle get() {
+return reduceApi.createLocalReducer(reduceOp);
--- End diff --

Actually, because you're using an ArrayOfHandles, this is going to be 
serialized-deserialized as a whole. This should be ok.

Though, this also assumes that reduce operations are stateless. In general, 
they are but there's nothing that prevents from writing a stateful one. Maybe 
add a comment about this.


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033370#comment-16033370
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r119682732
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -288,6 +289,9 @@ public BspService(
   throw new RuntimeException(e);
 }
 
+this.taskId = (int)getApplicationAttempt() * conf.getMaxWorkers() + 
conf.getTaskPartition();
--- End diff --

Please fix checkstyle (make sure 'mvn verify' passes)


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033371#comment-16033371
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r119682567
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -272,7 +273,7 @@ public BspService(
 }
 if (LOG.isInfoEnabled()) {
   LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +
-  ", " + getTaskPartition() + " on " + serverPortList);
+  ", " + getTaskId() + " on " + serverPortList);
--- End diff --

We need to set taskId before this


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033404#comment-16033404
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119686989
  
--- Diff: 
giraph-block-app/src/main/java/org/apache/giraph/block_app/library/Pieces.java 
---
@@ -320,6 +325,91 @@ public String toString() {
   }
 
   /**
+   * Like reduceAndBroadcast, but uses array of handles for reducers and
+   * broadcasts, to make it feasible and performant when values are large.
+   * Each supplied value to reduce will be reduced in the handle defined by
+   * handleHashSupplier%numHandles
+   *
+   * @param  Single value type, objects passed on workers
+   * @param  Reduced value type
+   * @param  Vertex id type
+   * @param  Vertex value type
+   * @param  Edge value type
+   */
+  public static
+  
+  Piece reduceAndBroadcastWithArrayOfHandles(
+  final String name,
+  final int numHandles,
+  final ReduceOperation reduceOp,
+  final SupplierFromVertex handleHashSupplier,
+  final SupplierFromVertex valueSupplier,
+  final ConsumerWithVertex reducedValueConsumer) {
+return new Piece() {
+  protected ArrayOfHandles.ArrayOfReducers reducers;
+  protected BroadcastArrayHandle broadcasts;
+
+  private int getHandleIndex(Vertex vertex) {
+return (int) Math.abs(handleHashSupplier.get(vertex) % numHandles);
+  }
+
+  @Override
+  public void registerReducers(
+  final CreateReducersApi reduceApi, Object executionStage) {
+reducers = new ArrayOfHandles.ArrayOfReducers<>(
+numHandles,
+new Supplier>() {
+  @Override
+  public ReducerHandle get() {
+return reduceApi.createLocalReducer(reduceOp);
--- End diff --

Good catch, it didn't occur to me. I'll fix it not to reuse the same 
ReduceOperation object.


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033539#comment-16033539
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119708329
  
--- Diff: 
giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java
 ---
@@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes(
 Pair componentToReducePair = Pair.of(
 new LongWritable(), new LongWritable(1));
 LongWritable reusableLong = new LongWritable();
-return Pieces.reduceAndBroadcast(
-"CalcConnectedComponentSizes",
+// This reduce operation is stateless so we can use a single instance
+BasicMapReduce 
reduceOperation =
 new BasicMapReduce<>(
-LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG),
+LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG);
+return Pieces.reduceAndBroadcastWithArrayOfHandles(
+"CalcConnectedComponentSizes",
+3137, /* Just using some large prime number */
--- End diff --

Should this be configurable?


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033817#comment-16033817
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user majakabiljo commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119744463
  
--- Diff: 
giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java
 ---
@@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes(
 Pair componentToReducePair = Pair.of(
 new LongWritable(), new LongWritable(1));
 LongWritable reusableLong = new LongWritable();
-return Pieces.reduceAndBroadcast(
-"CalcConnectedComponentSizes",
+// This reduce operation is stateless so we can use a single instance
+BasicMapReduce 
reduceOperation =
 new BasicMapReduce<>(
-LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG),
+LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG);
+return Pieces.reduceAndBroadcastWithArrayOfHandles(
+"CalcConnectedComponentSizes",
+3137, /* Just using some large prime number */
--- End diff --

I can't come up with a reason why someone would want to change it. This can 
start having problems only at trillion components which wouldn't work for many 
other reasons, for tiny ones this few reducers won't add any overhead, and for 
larger ones which were currently working this is still improvement since 
reducers are processed on many machines now.


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-01 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16033823#comment-16033823
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user dlogothetis commented on a diff in the pull request:

https://github.com/apache/giraph/pull/39#discussion_r119745185
  
--- Diff: 
giraph-block-app-8/src/main/java/org/apache/giraph/block_app/library/prepare_graph/UndirectedConnectedComponents.java
 ---
@@ -352,10 +352,15 @@ Block calculateConnectedComponentSizes(
 Pair componentToReducePair = Pair.of(
 new LongWritable(), new LongWritable(1));
 LongWritable reusableLong = new LongWritable();
-return Pieces.reduceAndBroadcast(
-"CalcConnectedComponentSizes",
+// This reduce operation is stateless so we can use a single instance
+BasicMapReduce 
reduceOperation =
 new BasicMapReduce<>(
-LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG),
+LongTypeOps.INSTANCE, LongTypeOps.INSTANCE, SumReduce.LONG);
+return Pieces.reduceAndBroadcastWithArrayOfHandles(
+"CalcConnectedComponentSizes",
+3137, /* Just using some large prime number */
--- End diff --

Sounds good. Looks good then.


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1148) Connected components - make calculate sizes work with large number of components

2017-06-02 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16035065#comment-16035065
 ] 

ASF GitHub Bot commented on GIRAPH-1148:


Github user asfgit closed the pull request at:

https://github.com/apache/giraph/pull/39


> Connected components - make calculate sizes work with large number of 
> components
> 
>
> Key: GIRAPH-1148
> URL: https://issues.apache.org/jira/browse/GIRAPH-1148
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Maja Kabiljo
>Assignee: Maja Kabiljo
>
> Currently if we have a graph with large number of connected components, 
> calculating connected components sizes fails because reducer becomes too 
> large. Use array of handles instead.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049379#comment-16049379
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on a diff in the pull request:

https://github.com/apache/giraph/pull/30#discussion_r122006839
  
--- Diff: giraph-core/src/main/java/org/apache/giraph/bsp/BspService.java 
---
@@ -272,7 +273,7 @@ public BspService(
 }
 if (LOG.isInfoEnabled()) {
   LOG.info("BspService: Connecting to ZooKeeper with job " + jobId +
-  ", " + getTaskPartition() + " on " + serverPortList);
+  ", " + getTaskId() + " on " + serverPortList);
--- End diff --

We can't actually set `taskId` before creating the ZK client, since we need 
ZK to get the get the application attempt. I changed this to log the partition 
instead, which we can get from `conf`.


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-06-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16049382#comment-16049382
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Fixed @majakabiljo's comments. @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GIRAPH-1151) Avoid message value factory initialization in Apache Giraph

2017-08-17 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16130051#comment-16130051
 ] 

ASF GitHub Bot commented on GIRAPH-1151:


GitHub user yukselakinci opened a pull request:

https://github.com/apache/giraph/pull/42

Avoid message value factory initialization in Apache Giraph

Jira ticket id: https://issues.apache.org/jira/browse/GIRAPH-1151

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yukselakinci/giraph avoidfactoryinit

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/giraph/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit 1216cde8326753c5a6fafa315cab2d59c7c7d7d0
Author: Yuksel Akinci 
Date:   2017-08-16T23:31:51Z

Avoid message value factory initialization in Apache Giraph
Jira ticket id: https://issues.apache.org/jira/browse/GIRAPH-1151




> Avoid message value factory initialization in Apache Giraph
> ---
>
> Key: GIRAPH-1151
> URL: https://issues.apache.org/jira/browse/GIRAPH-1151
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Yuksel Akinci
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Messages in Giraph are instantiated using a "message value factory" class. 
> Currently the message value factory gets instantiated every time a message is 
> sent, which is unnecessary and can cause high overhead if the message value 
> factory constructor contains expensive operation. Factory objects are  be 
> saved to avoid repeated expensive object creations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GIRAPH-1151) Avoid message value factory initialization in Apache Giraph

2017-08-22 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137400#comment-16137400
 ] 

ASF GitHub Bot commented on GIRAPH-1151:


Github user yukselakinci closed the pull request at:

https://github.com/apache/giraph/pull/42


> Avoid message value factory initialization in Apache Giraph
> ---
>
> Key: GIRAPH-1151
> URL: https://issues.apache.org/jira/browse/GIRAPH-1151
> Project: Giraph
>  Issue Type: Improvement
>Reporter: Yuksel Akinci
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Messages in Giraph are instantiated using a "message value factory" class. 
> Currently the message value factory gets instantiated every time a message is 
> sent, which is unnecessary and can cause high overhead if the message value 
> factory constructor contains expensive operation. Factory objects are  be 
> saved to avoid repeated expensive object creations.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (GIRAPH-1139) Resuming from checkpoint doesn't work

2017-08-23 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/GIRAPH-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138529#comment-16138529
 ] 

ASF GitHub Bot commented on GIRAPH-1139:


Github user neggert commented on the issue:

https://github.com/apache/giraph/pull/30
  
Ping @edunov 


> Resuming from checkpoint doesn't work
> -
>
> Key: GIRAPH-1139
> URL: https://issues.apache.org/jira/browse/GIRAPH-1139
> Project: Giraph
>  Issue Type: Bug
>  Components: bsp
>Affects Versions: 1.2.0
>Reporter: Nic Eggert
>
> I ran into a couple of issues when trying to get Giraph to resume from 
> checkpoints (using mapreduce.max.attempts rather than GiraphJobRetryChecker).
> * If we just wrote a checkpoint, the master expects the workers to checkpoint 
> again, while the workers (correctly) clear the checkpointing flag.
> * When workers restart, they take their task id from the partition number, 
> which stays the same across multiple attempts. This gets transferred to the 
> Netty clientId, and the server starts ignoring messages from restarted 
> workers because it thinks it processed them already.
> I believe I've fixed these issues. I'll send a GitHub PR shortly.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

1 2 3 >

1 - 100 of 260 matches

Mail list logo