Re: Hadoop MRUnit

2012-08-06 Thread Jim Donofrio
I think you mean does MRUnit support the old mapred api. cdh3 includes 
both the old mapred api and new mapreduce api. Yes MRUnit supports both 
mapred and mapreduce. Use the classes in 
org.apache.hadoop.mrunit.{MapDriver, ReduceDriver, MapReduceDriver} for 
the old mapred api and the classes in 
org.apache.hadoop.mrunit.mapreduce.{MapDriver, ReduceDriver, 
MapReduceDriver} for the new mapreduce api


BCCing hadoop user list.

On 08/06/2012 08:03 PM, Mohit Anchlia wrote:

Does MRUnit also work with cdh3 version? We are currently using cdh3 but in
the examples I see it's all using new context class.





Re: Exec hadoop from Java, reuse JVM (client-side)?

2012-08-01 Thread Jim Donofrio
Why would you call the hadoop script, why not just call the part of the 
hadoop shell api you are trying to call directly from java?



On 08/01/2012 07:37 PM, Keith Wiley wrote:

Hmmm, at first glance that does appear to be similar to my situation.  I'll 
have to delve through it in detail to see if it squarely addresses (and fixes) 
my problem.  Mine is sporadic and I suspect dependent on the current memory 
situation (it isn't a deterministic and guaranteed failure).  I am not sure if 
that is true of the stackoverflow question you referenced...but it is certainly 
worth reading over.

Thanks.

On Aug 1, 2012, at 15:34 , Dhruv wrote:


Is this related?

http://stackoverflow.com/questions/1124771/how-to-solve-java-io-ioexception-error-12-cannot-allocate-memory-calling-run



Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com

It's a fine line between meticulous and obsessive-compulsive and a slippery
rope between obsessive-compulsive and debilitatingly slow.
--  Keith Wiley







how to increment counters inside of InputFormat/RecordReader in mapreduce api?

2012-07-30 Thread Jim Donofrio
In the mapred api getRecordReader was passed a Reporter which could then 
get passed to the RecordReader to allow a RecordReader to increment 
counters for different types of records, bad records, etc. In the new 
mapreduce api createRecordReader only gets the InputSplit and 
TaskAttemptContext, both have no access to counters.


Is there really no way to increment counters inside of a RecordReader or 
InputFormat in the mapreduce api?


int read(byte buf[], int off, int len) violates api level contract when length is 0 at the end of a stream

2012-07-23 Thread Jim Donofrio

api contract on java public int read(byte[] buffer[], int off, int len):
If len is zero, then no bytes are read and 0 is returned; otherwise, 
there is an attempt to read at least one byte. If no byte is available 
because the stream is at end of file, the value -1 is returned; 
otherwise, at least one byte is read and stored into b.


DFSInputStream in hadoop 1 and 2 returns 0 in if len is 0 as long as the 
position is not currently at the end of a stream but if at the end of a 
stream and len is 0, read returns -1 instead of 0 because pos = 
getFileLength()


/**
 * Read the entire buffer.
 */
@Override
public synchronized int read(byte buf[], int off, int len) throws 
IOException {

  checkOpen();
  if (closed) {
throw new IOException(Stream closed);
  }
  failures = 0;

  if (pos  getFileLength()) {
int retries = 2;
while (retries  0) {
  try {
if (pos  blockEnd) {
  currentNode = blockSeekTo(pos);
}
int realLen = Math.min(len, (int) (blockEnd - pos + 1));
int result = readBuffer(buf, off, realLen);

if (result = 0) {
  pos += result;
} else {
  // got a EOS from reader though we expect more data on it.
  throw new IOException(Unexpected EOS from the reader);
}
if (stats != null  result != -1) {
  stats.incrementBytesRead(result);
}
return result;
  } catch (ChecksumException ce) {
throw ce;
  } catch (IOException e) {
if (retries == 1) {
  LOG.warn(DFS Read:  + StringUtils.stringifyException(e));
}
blockEnd = -1;
if (currentNode != null) { addToDeadNodes(currentNode); }
if (--retries == 0) {
  throw e;
}
  }
}
  }
  return -1;
}


mapreduce.job.max.split.locations just a warning in hadoop 1.0.3 but not in 2.0.1-alpha?

2012-06-05 Thread Jim Donofrio

final int max_loc = conf.getInt(MAX_SPLIT_LOCATIONS, 10);
if (locations.length  max_loc) {
  LOG.warn(Max block location exceeded for split: 
  + split +  splitsize:  + locations.length +
   maxsize:  + max_loc);
  locations = Arrays.copyOf(locations, max_loc);
}

I was wondering about the above code in JobSplitWriter in hadoop 1.0.3. 
The below commit comment is somewhat vague. I saw MAPREDUCE-1943 about 
setting limits to save memory on the jobtracker. I wanted to confirm 
that the above fix just serves as a warning and saves memory on the 
jobtracker and does not cap the input at all since most inputformats 
seem to ignore the locations?


I also wanted to know why the recent MAPREDUCE-4146 added this cap to 
2.0.1-alpha but with the original capping behavior of causing the job to 
fail by throwing an IOException instead of just warning the user as the 
current code does?



commit 51be5c3d61cbc7960174493428fbaa41d5fbe84d
Author: Chris Douglas cdoug...@apache.org
Date:   Fri Oct 1 01:49:51 2010 -0700

 Change client-side enforcement of limit on locations per split
to be advisory. Truncate on client, optionally fail job at 
JobTracker if

exceeded. Added mapreduce.job.max.split.locations property.

+++ b/YAHOO-CHANGES.txt
+ Change client-side enforcement of limit on locations per split
+to be advisory. Truncate on client, optionally fail job at 
JobTracker if
+exceeded. Added mapreduce.job.max.split.locations property. 
(cdouglas)

+



how to rebalance individual data node?

2012-05-18 Thread Jim Donofrio
Lets say that every node in your cluster has 2 same sized disks and one 
is 50% full and the other is 100% full. According to my understanding of 
the balancer documentation, all data nodes will be at the average 
utilization of 75% so no balancing will occur yet one hard drive in each 
node is struggling at capacity. Is there any way to run the balancer 
just on a datanode to force each disk to be 75% full?


Thanks


Re: cannot use a map side join to merge the output of multiple map side joins

2012-05-07 Thread Jim Donofrio
I ended up just using a MultiNamedMultipleOutput with the dynamic part 
of the multioutput set to the partition number from one of the 
filesplit's inside of the CompositeInputSplit


On 05/07/2012 11:19 AM, Robert Evans wrote:

I believe that you are correct about the split processing.  It orders the 
splits by size so that the largest splits are processed first.  This allows for 
the smaller splits to potentially fill in the gaps.  As far as a fix is 
concerned I think overriding the file name in the file output committer is a 
much more straight forward solution to the issue.

--Bobby Evans

On 5/5/12 10:50 AM, Jim Donofriodonofrio...@gmail.com  wrote:

I am trying to use a map side join to merge the output of multiple map
side joins. This is failing because of the below code in
JobClient.writeOldSplits which reorders the splits from largest to
smallest. Why is that done, is that so that the largest split which will
take the longest gets processed first?

Each map side join then fails to name its part-* files with the same
number as the incoming partition so files that named part-0 that go
into the first map side join get outputted to part-00010 while another
one of the first level map side joins sends files named part-0 to
part-5. The second level map side join then does not get the input
splits in partitioner order from each first level map side join output
directory.

I can think of only 2 fixes, add some conf property to allow turning off
the below sorting OR extend FileOutputCommitter to rename the outputs of
the first level map side join to merge_part-the orginal partition
number. Any other solutions?

  // sort the splits into order based on size, so that the biggest
  // go first
  Arrays.sort(splits, new
Comparatororg.apache.hadoop.mapred.InputSplit() {
public int compare(org.apache.hadoop.mapred.InputSplit a,
   org.apache.hadoop.mapred.InputSplit b) {
  try {
long left = a.getLength();
long right = b.getLength();
if (left == right) {
  return 0;
} else if (left  right) {
  return 1;
} else {
  return -1;
}




can HADOOP-6546: BloomMapFile can return false negatives get backported to branch-1?

2012-05-07 Thread Jim Donofrio
Can someone backport HADOOP-6546: BloomMapFile can return false 
negatives to branch-1 for the next 1+ release?


Without this fix BloomMapFile is somewhat useless because having no 
false negatives is a core feature of BloomFilters. I am surprised that 
both hadoop 1.0.2 and cdh3u3 do not have this fix from over 2 years ago.


cannot use a map side join to merge the output of multiple map side joins

2012-05-05 Thread Jim Donofrio
I am trying to use a map side join to merge the output of multiple map 
side joins. This is failing because of the below code in 
JobClient.writeOldSplits which reorders the splits from largest to 
smallest. Why is that done, is that so that the largest split which will 
take the longest gets processed first?


Each map side join then fails to name its part-* files with the same 
number as the incoming partition so files that named part-0 that go 
into the first map side join get outputted to part-00010 while another 
one of the first level map side joins sends files named part-0 to 
part-5. The second level map side join then does not get the input 
splits in partitioner order from each first level map side join output 
directory.


I can think of only 2 fixes, add some conf property to allow turning off 
the below sorting OR extend FileOutputCommitter to rename the outputs of 
the first level map side join to merge_part-the orginal partition 
number. Any other solutions?


// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(splits, new 
Comparatororg.apache.hadoop.mapred.InputSplit() {

  public int compare(org.apache.hadoop.mapred.InputSplit a,
 org.apache.hadoop.mapred.InputSplit b) {
try {
  long left = a.getLength();
  long right = b.getLength();
  if (left == right) {
return 0;
  } else if (left  right) {
return 1;
  } else {
return -1;
  }


why does Text.setCapacity not double the array size as in most dynamic array implementations?

2012-04-24 Thread Jim Donofrio

  private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length  len) {
  byte[] newBytes = new byte[len];
  if (bytes != null  keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
  }
  bytes = newBytes;
}
  }

Why does Text.setCapacity only expand the array to the length of the new 
data? Why not instead set the length to double the new requested length 
or 3/2 times the existing length as in ArrayList so that the array size 
will grow exponentially as in most dynamic array implentations?




Re: why does Text.setCapacity not double the array size as in most dynamic array implementations?

2012-04-24 Thread Jim Donofrio
Sorry, I just stumbled across HADOOP-6109 which made this change in 
trunk, I was looking at the Text in 1.0.2. Cant get this fix get 
backported to the Hadoop 1 versions?


On 04/24/2012 11:01 PM, Jim Donofrio wrote:

private void setCapacity(int len, boolean keepData) {
if (bytes == null || bytes.length  len) {
byte[] newBytes = new byte[len];
if (bytes != null  keepData) {
System.arraycopy(bytes, 0, newBytes, 0, length);
}
bytes = newBytes;
}
}

Why does Text.setCapacity only expand the array to the length of the new
data? Why not instead set the length to double the new requested length
or 3/2 times the existing length as in ArrayList so that the array size
will grow exponentially as in most dynamic array implentations?