RE: MultiVertexInputFormat

2013-08-28 Thread Yasser Altowim
Thanks Maja for your response. That works but as I told you I had to modify the 
implementation of the MultiVertexInputFormat. I am posting my fix here in case 
someone runs into a similar problem.

  @Override
  public VertexReaderI, V, E createVertexReader(InputSplit inputSplit,
  TaskAttemptContext context) throws IOException {
if (inputSplit instanceof InputSplitWithInputFormatIndex) {
  // When multithreaded input is used we need to make sure other threads
  // don't change context's configuration while we use it
  synchronized (context) {
InputSplitWithInputFormatIndex split =
(InputSplitWithInputFormatIndex) inputSplit;
VertexInputFormatI, V, E vertexInputFormat =
vertexInputFormats.get(split.getInputFormatIndex());
VertexReaderI, V, E vertexReader =
vertexInputFormat.createVertexReader(split.getSplit(), context);
return new WrappedVertexReaderI, V, E(
vertexReader, vertexInputFormat.getConf()) {
  @Override
  public void initialize(InputSplit inputSplit,
  TaskAttemptContext context) throws IOException,
  InterruptedException {
// When multithreaded input is used we need to make sure other
// threads don't change context's configuration while we use it
synchronized (context) {
  super.initialize(inputSplit, context);
}
  }
};
  }
} else {
  throw new IllegalStateException(createVertexReader: Got InputSplit  +
  which was not created by this class:  +
  inputSplit.getClass().getName());
}
  }

I changed the line in red above to the following:
super.initialize(((InputSplitWithInputFormatIndex) inputSplit).getSplit(), 
context);


Best,
Yasser

From: Maja Kabiljo [mailto:majakabi...@fb.com]
Sent: Wednesday, August 21, 2013 8:24 PM
To: user@giraph.apache.orgmailto:user@giraph.apache.org
Subject: Re: MultiVertexInputFormat

Hi Yasser,

You can do this through the Configuration parameters. You should call:
description1.addParameter(myApplication.vertexInputPath, file1.txt);
and
description2.addParameter(myApplication.vertexInputPath, file2.txt);
Then from the code of your InputFormat class you can get this parameter from 
Configuration. If it's not already, make sure your InputFormat implements 
ImmutableClassesGiraphConfigurable, and configuration is going to be set in it 
automatically.

You can also take a look at HiveGiraphRunner which uses multiple inputs and 
sets parameters user passes from command line.

Hope this helps,
Maja

From: Yasser Altowim 
yasser.alto...@ericsson.commailto:yasser.alto...@ericsson.com
Reply-To: user@giraph.apache.orgmailto:user@giraph.apache.org 
user@giraph.apache.orgmailto:user@giraph.apache.org
Date: Monday, August 19, 2013 9:16 AM
To: user@giraph.apache.orgmailto:user@giraph.apache.org 
user@giraph.apache.orgmailto:user@giraph.apache.org
Subject: RE: MultiVertexInputFormat

Hi Guys,

 Any help on this will be appreciated. I am repeating my question and my 
code below:


I am implementing an algorithm in Giraph that reads the vertex values from two 
input files, each has its own format. I am not using  any EdgeInputFormatClass. 
I am now using VertexInputFormatDescription along with MultiVertexInputFormats, 
but still could not figure out how to set the Vertex input path for each Input 
Format Class. Can you please take a look at my code below and show me how to 
set the Vertex Input Path? I have taken a look at HiveGiraphRunner but still no 
luck. Thanks

if (null == getConf()) {
conf = new Configuration();
}

GiraphConfiguration gconf = new GiraphConfiguration(getConf());
int workers = Integer.parseInt(arg0[2]);
gconf.setWorkerConfiguration(workers, workers, 100.0f);

ListVertexInputFormatDescription vertexInputDescriptions = 
Lists.newArrayList();

// Input one
VertexInputFormatDescription description1 = new 
VertexInputFormatDescription(UseCase1FirstVertexInputFormat.class);
// how to set the vertex input path? i.e. how to say that I want to read 
file1.txt using this input format class
vertexInputDescriptions.add(description1);

// Input two
VertexInputFormatDescription description2 = new 
VertexInputFormatDescription(UseCase1SecondVertexInputFormat.class);
// how to set the vertex input path?
vertexInputDescriptions.add(description2);


GiraphConstants.VERTEX_INPUT_FORMAT_CLASS.set(gconf,

MultiVertexInputFormat.class);

VertexInputFormatDescription.VERTEX_INPUT_FORMAT_DESCRIPTIONS.set(gconf,InputFormatDescription.toJsonString(vertexInputDescriptions));

gconf.setVertexOutputFormatClass(UseCase1OutputFormat.class);
gconf.setComputationClass(UseCase1Vertex.class);
GiraphJob job = new GiraphJob(gconf, Use Case 1);

Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-28 Thread Jeff Peters
I am tasked with updating our ancient (circa 7/10/2012) Giraph to
giraph-release-1.0.0-RC3. Most jobs run fine but our largest job now runs
out of memory using the same AWS elastic-mapreduce configuration we have
always used. I have never tried to configure either Giraph or the AWS
Hadoop. We build for Hadoop 1.0.2 because that's closest to the 1.0.3 AWS
provides us. The 8 X m2.4xlarge cluster we use seems to provide 8*14=112
map tasks fitted out with 2GB heap each. Our code is completely unchanged
except as required to adapt to the new Giraph APIs. Our vertex, edge, and
message data are completely unchanged. On smaller jobs, that work, the
aggregate heap usage high-water mark seems about the same as before, but
the committed heap seems to run higher. I can't even make it work on a
cluster of 12. In that case I get one map task that seems to end up with
nearly twice as many messages as most of the others so it runs out of
memory anyway. It only takes one to fail the job. Am I missing something
here? Should I be configuring my new Giraph in some way I didn't used to
need to with the old one?


Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-28 Thread Avery Ching
Try dumping a histogram of memory usage from a running JVM and see where 
the memory is going.  I can't think of anything in particular that 
changed...


On 8/28/13 4:39 PM, Jeff Peters wrote:


I am tasked with updating our ancient (circa 7/10/2012) Giraph to 
giraph-release-1.0.0-RC3. Most jobs run fine but our largest job now 
runs out of memory using the same AWS elastic-mapreduce configuration 
we have always used. I have never tried to configure either Giraph or 
the AWS Hadoop. We build for Hadoop 1.0.2 because that's closest to 
the 1.0.3 AWS provides us. The 8 X m2.4xlarge cluster we use seems to 
provide 8*14=112 map tasks fitted out with 2GB heap each. Our code is 
completely unchanged except as required to adapt to the new Giraph 
APIs. Our vertex, edge, and message data are completely unchanged. On 
smaller jobs, that work, the aggregate heap usage high-water mark 
seems about the same as before, but the committed heap seems to run 
higher. I can't even make it work on a cluster of 12. In that case I 
get one map task that seems to end up with nearly twice as many 
messages as most of the others so it runs out of memory anyway. It 
only takes one to fail the job. Am I missing something here? Should I 
be configuring my new Giraph in some way I didn't used to need to with 
the old one?






Re: Out of memory with giraph-release-1.0.0-RC3, used to work on old Giraph

2013-08-28 Thread Jeff Peters
Ok thanks Avery, I'll try it. I'm not sure I know how I would do that on a
running AWS EMR instance, but I can do it on a local stand-alone Hadoop
running a smaller version of the job and see if anything jumps out.


On Wed, Aug 28, 2013 at 4:57 PM, Avery Ching ach...@apache.org wrote:

 Try dumping a histogram of memory usage from a running JVM and see where
 the memory is going.  I can't think of anything in particular that
 changed...


 On 8/28/13 4:39 PM, Jeff Peters wrote:


 I am tasked with updating our ancient (circa 7/10/2012) Giraph to
 giraph-release-1.0.0-RC3. Most jobs run fine but our largest job now runs
 out of memory using the same AWS elastic-mapreduce configuration we have
 always used. I have never tried to configure either Giraph or the AWS
 Hadoop. We build for Hadoop 1.0.2 because that's closest to the 1.0.3 AWS
 provides us. The 8 X m2.4xlarge cluster we use seems to provide 8*14=112
 map tasks fitted out with 2GB heap each. Our code is completely unchanged
 except as required to adapt to the new Giraph APIs. Our vertex, edge, and
 message data are completely unchanged. On smaller jobs, that work, the
 aggregate heap usage high-water mark seems about the same as before, but
 the committed heap seems to run higher. I can't even make it work on a
 cluster of 12. In that case I get one map task that seems to end up with
 nearly twice as many messages as most of the others so it runs out of
 memory anyway. It only takes one to fail the job. Am I missing something
 here? Should I be configuring my new Giraph in some way I didn't used to
 need to with the old one?