Re: Hadoop MRUnit
I think you mean does MRUnit support the old mapred api. cdh3 includes both the old mapred api and new mapreduce api. Yes MRUnit supports both mapred and mapreduce. Use the classes in org.apache.hadoop.mrunit.{MapDriver, ReduceDriver, MapReduceDriver} for the old mapred api and the classes in org.apache.hadoop.mrunit.mapreduce.{MapDriver, ReduceDriver, MapReduceDriver} for the new mapreduce api BCCing hadoop user list. On 08/06/2012 08:03 PM, Mohit Anchlia wrote: Does MRUnit also work with cdh3 version? We are currently using cdh3 but in the examples I see it's all using new context class.
Re: Exec hadoop from Java, reuse JVM (client-side)?
Why would you call the hadoop script, why not just call the part of the hadoop shell api you are trying to call directly from java? On 08/01/2012 07:37 PM, Keith Wiley wrote: Hmmm, at first glance that does appear to be similar to my situation. I'll have to delve through it in detail to see if it squarely addresses (and fixes) my problem. Mine is sporadic and I suspect dependent on the current memory situation (it isn't a deterministic and guaranteed failure). I am not sure if that is true of the stackoverflow question you referenced...but it is certainly worth reading over. Thanks. On Aug 1, 2012, at 15:34 , Dhruv wrote: Is this related? http://stackoverflow.com/questions/1124771/how-to-solve-java-io-ioexception-error-12-cannot-allocate-memory-calling-run Keith Wiley kwi...@keithwiley.com keithwiley.commusic.keithwiley.com It's a fine line between meticulous and obsessive-compulsive and a slippery rope between obsessive-compulsive and debilitatingly slow. -- Keith Wiley
how to increment counters inside of InputFormat/RecordReader in mapreduce api?
In the mapred api getRecordReader was passed a Reporter which could then get passed to the RecordReader to allow a RecordReader to increment counters for different types of records, bad records, etc. In the new mapreduce api createRecordReader only gets the InputSplit and TaskAttemptContext, both have no access to counters. Is there really no way to increment counters inside of a RecordReader or InputFormat in the mapreduce api?
int read(byte buf[], int off, int len) violates api level contract when length is 0 at the end of a stream
api contract on java public int read(byte[] buffer[], int off, int len): If len is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at end of file, the value -1 is returned; otherwise, at least one byte is read and stored into b. DFSInputStream in hadoop 1 and 2 returns 0 in if len is 0 as long as the position is not currently at the end of a stream but if at the end of a stream and len is 0, read returns -1 instead of 0 because pos = getFileLength() /** * Read the entire buffer. */ @Override public synchronized int read(byte buf[], int off, int len) throws IOException { checkOpen(); if (closed) { throw new IOException(Stream closed); } failures = 0; if (pos getFileLength()) { int retries = 2; while (retries 0) { try { if (pos blockEnd) { currentNode = blockSeekTo(pos); } int realLen = Math.min(len, (int) (blockEnd - pos + 1)); int result = readBuffer(buf, off, realLen); if (result = 0) { pos += result; } else { // got a EOS from reader though we expect more data on it. throw new IOException(Unexpected EOS from the reader); } if (stats != null result != -1) { stats.incrementBytesRead(result); } return result; } catch (ChecksumException ce) { throw ce; } catch (IOException e) { if (retries == 1) { LOG.warn(DFS Read: + StringUtils.stringifyException(e)); } blockEnd = -1; if (currentNode != null) { addToDeadNodes(currentNode); } if (--retries == 0) { throw e; } } } } return -1; }
mapreduce.job.max.split.locations just a warning in hadoop 1.0.3 but not in 2.0.1-alpha?
final int max_loc = conf.getInt(MAX_SPLIT_LOCATIONS, 10); if (locations.length max_loc) { LOG.warn(Max block location exceeded for split: + split + splitsize: + locations.length + maxsize: + max_loc); locations = Arrays.copyOf(locations, max_loc); } I was wondering about the above code in JobSplitWriter in hadoop 1.0.3. The below commit comment is somewhat vague. I saw MAPREDUCE-1943 about setting limits to save memory on the jobtracker. I wanted to confirm that the above fix just serves as a warning and saves memory on the jobtracker and does not cap the input at all since most inputformats seem to ignore the locations? I also wanted to know why the recent MAPREDUCE-4146 added this cap to 2.0.1-alpha but with the original capping behavior of causing the job to fail by throwing an IOException instead of just warning the user as the current code does? commit 51be5c3d61cbc7960174493428fbaa41d5fbe84d Author: Chris Douglas cdoug...@apache.org Date: Fri Oct 1 01:49:51 2010 -0700 Change client-side enforcement of limit on locations per split to be advisory. Truncate on client, optionally fail job at JobTracker if exceeded. Added mapreduce.job.max.split.locations property. +++ b/YAHOO-CHANGES.txt + Change client-side enforcement of limit on locations per split +to be advisory. Truncate on client, optionally fail job at JobTracker if +exceeded. Added mapreduce.job.max.split.locations property. (cdouglas) +
how to rebalance individual data node?
Lets say that every node in your cluster has 2 same sized disks and one is 50% full and the other is 100% full. According to my understanding of the balancer documentation, all data nodes will be at the average utilization of 75% so no balancing will occur yet one hard drive in each node is struggling at capacity. Is there any way to run the balancer just on a datanode to force each disk to be 75% full? Thanks
Re: cannot use a map side join to merge the output of multiple map side joins
I ended up just using a MultiNamedMultipleOutput with the dynamic part of the multioutput set to the partition number from one of the filesplit's inside of the CompositeInputSplit On 05/07/2012 11:19 AM, Robert Evans wrote: I believe that you are correct about the split processing. It orders the splits by size so that the largest splits are processed first. This allows for the smaller splits to potentially fill in the gaps. As far as a fix is concerned I think overriding the file name in the file output committer is a much more straight forward solution to the issue. --Bobby Evans On 5/5/12 10:50 AM, Jim Donofriodonofrio...@gmail.com wrote: I am trying to use a map side join to merge the output of multiple map side joins. This is failing because of the below code in JobClient.writeOldSplits which reorders the splits from largest to smallest. Why is that done, is that so that the largest split which will take the longest gets processed first? Each map side join then fails to name its part-* files with the same number as the incoming partition so files that named part-0 that go into the first map side join get outputted to part-00010 while another one of the first level map side joins sends files named part-0 to part-5. The second level map side join then does not get the input splits in partitioner order from each first level map side join output directory. I can think of only 2 fixes, add some conf property to allow turning off the below sorting OR extend FileOutputCommitter to rename the outputs of the first level map side join to merge_part-the orginal partition number. Any other solutions? // sort the splits into order based on size, so that the biggest // go first Arrays.sort(splits, new Comparatororg.apache.hadoop.mapred.InputSplit() { public int compare(org.apache.hadoop.mapred.InputSplit a, org.apache.hadoop.mapred.InputSplit b) { try { long left = a.getLength(); long right = b.getLength(); if (left == right) { return 0; } else if (left right) { return 1; } else { return -1; }
can HADOOP-6546: BloomMapFile can return false negatives get backported to branch-1?
Can someone backport HADOOP-6546: BloomMapFile can return false negatives to branch-1 for the next 1+ release? Without this fix BloomMapFile is somewhat useless because having no false negatives is a core feature of BloomFilters. I am surprised that both hadoop 1.0.2 and cdh3u3 do not have this fix from over 2 years ago.
cannot use a map side join to merge the output of multiple map side joins
I am trying to use a map side join to merge the output of multiple map side joins. This is failing because of the below code in JobClient.writeOldSplits which reorders the splits from largest to smallest. Why is that done, is that so that the largest split which will take the longest gets processed first? Each map side join then fails to name its part-* files with the same number as the incoming partition so files that named part-0 that go into the first map side join get outputted to part-00010 while another one of the first level map side joins sends files named part-0 to part-5. The second level map side join then does not get the input splits in partitioner order from each first level map side join output directory. I can think of only 2 fixes, add some conf property to allow turning off the below sorting OR extend FileOutputCommitter to rename the outputs of the first level map side join to merge_part-the orginal partition number. Any other solutions? // sort the splits into order based on size, so that the biggest // go first Arrays.sort(splits, new Comparatororg.apache.hadoop.mapred.InputSplit() { public int compare(org.apache.hadoop.mapred.InputSplit a, org.apache.hadoop.mapred.InputSplit b) { try { long left = a.getLength(); long right = b.getLength(); if (left == right) { return 0; } else if (left right) { return 1; } else { return -1; }
why does Text.setCapacity not double the array size as in most dynamic array implementations?
private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length len) { byte[] newBytes = new byte[len]; if (bytes != null keepData) { System.arraycopy(bytes, 0, newBytes, 0, length); } bytes = newBytes; } } Why does Text.setCapacity only expand the array to the length of the new data? Why not instead set the length to double the new requested length or 3/2 times the existing length as in ArrayList so that the array size will grow exponentially as in most dynamic array implentations?
Re: why does Text.setCapacity not double the array size as in most dynamic array implementations?
Sorry, I just stumbled across HADOOP-6109 which made this change in trunk, I was looking at the Text in 1.0.2. Cant get this fix get backported to the Hadoop 1 versions? On 04/24/2012 11:01 PM, Jim Donofrio wrote: private void setCapacity(int len, boolean keepData) { if (bytes == null || bytes.length len) { byte[] newBytes = new byte[len]; if (bytes != null keepData) { System.arraycopy(bytes, 0, newBytes, 0, length); } bytes = newBytes; } } Why does Text.setCapacity only expand the array to the length of the new data? Why not instead set the length to double the new requested length or 3/2 times the existing length as in ArrayList so that the array size will grow exponentially as in most dynamic array implentations?