[ https://issues.apache.org/jira/browse/MAPREDUCE-6635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junping Du updated MAPREDUCE-6635: ---------------------------------- Status: Patch Available (was: Open) > Unsafe long to int conversion in UncompressedSplitLineReader and > IndexOutOfBoundsException > ------------------------------------------------------------------------------------------ > > Key: MAPREDUCE-6635 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6635 > Project: Hadoop Map/Reduce > Issue Type: Bug > Reporter: Sergey Shelukhin > Assignee: Junping Du > Attachments: MAPREDUCE-6635.patch > > > LineRecordReader creates the unsplittable reader like so: > {noformat} > in = new UncompressedSplitLineReader( > fileIn, job, recordDelimiter, split.getLength()); > {noformat} > Split length goes to > {noformat} > private long splitLength; > {noformat} > At some point when reading the first line, fillBuffer does this: > {noformat} > @Override > protected int fillBuffer(InputStream in, byte[] buffer, boolean inDelimiter) > throws IOException { > int maxBytesToRead = buffer.length; > if (totalBytesRead < splitLength) { > maxBytesToRead = Math.min(maxBytesToRead, > (int)(splitLength - totalBytesRead)); > {noformat} > which will be a negative number for large splits, and the subsequent dfs read > will fail with a boundary check. > {noformat} > java.lang.IndexOutOfBoundsException > at java.nio.Buffer.checkBounds(Buffer.java:559) > at java.nio.ByteBuffer.get(ByteBuffer.java:668) > at java.nio.DirectByteBuffer.get(DirectByteBuffer.java:279) > at > org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:172) > at > org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:744) > at > org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:800) > at > org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:860) > at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903) > at java.io.DataInputStream.read(DataInputStream.java:149) > at > org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.fillBuffer(UncompressedSplitLineReader.java:59) > at > org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:216) > at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174) > at > org.apache.hadoop.mapreduce.lib.input.UncompressedSplitLineReader.readLine(UncompressedSplitLineReader.java:91) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.skipUtfByteOrderMark(LineRecordReader.java:144) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:184) > {noformat} > This has been reported here: https://issues.streamsets.com/browse/SDC-2229, > also happens in Hive if very large text files are forced to be read in a > single split (e.g. via header-skipping feature, or via set > mapred.min.split.size=9999999999999999) -- This message was sent by Atlassian JIRA (v6.3.4#6332)