Hello.

I'm unsure of if this is a bug or an oversight, but since I've not found any 
reference anywhere to this, I figured I might bring it to light.

I've been using MultipleInputs for several of my MapReduce jobs, where I am 
joining together different forms of data. However, I have encountered the 
following exception with some uses of MultipleInputs in Hadoop 0.20.2:

java.lang.ArrayIndexOutOfBoundsException: 1
 at 
org.apache.hadoop.mapred.lib.MultipleInputs.getInputFormatMap(MultipleInputs.java:94)
 at 
org.apache.hadoop.mapred.lib.DelegatingInputFormat.getSplits(DelegatingInputFormat.java:51)
 at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
 at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
 at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
 at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)

After tracing through the source code, it appears that this occurs when the 
input path specified in MultipleInputs#addInputPath() contains a comma, which 
most often using globs (for example, "/months/{March,April,May}.txt"). Because 
the path itself contains commas, one of the two special delimiters used in 
MultipleInputs#getInputFormatMap(), when the input format map is being created, 
it parses the path-inputformat data incorrectly.

Could someone verify this behavior in other versions of Hadoop? And possibly 
the more important question, should this actually be considered a bug in 
MultipleInputs?

Thanks.

________________________________
This e-mail and any attachments may contain confidential material for the sole 
use of the intended recipient. If you are not the intended recipient, please be 
aware that any disclosure, copying, distribution or use of this e-mail or any 
attachment is prohibited. If you have received this e-mail in error, please 
contact the sender and delete all copies.

Thank you for your cooperation.

Reply via email to