[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794084#action_12794084 ]
BitsOfInfo commented on MAPREDUCE-1176: --------------------------------------- Chris, thanks for the comments, to address these: >>This should offer fixed key/value bytes, not just value bytes. >>The value type should be BytesWritable, not Text. To clarify, so I'll modify this to change the KEY and VAL to be <BytesWritable, BytesWritable>, then propose to add a config property to the FixedLengthInputFormat to allow someone to configure the "prefix" number of value (record) bytes to use as the KEY, such as "mapreduce.input.fixedlengthinputformat.keyprefixcount" or "keyprefixbytes" etc. Please send over any suggestions or other ideas. >> The double arithmetic should be replaced by modular arithmetic. Will change to: {code} // determine the split size, it should be as close as possible to the // default size, but should NOT split within a record... each split // should contain a complete set of records with the first record // starting at the first byte in the split and the last record ending // with the last byte in the split. long splitSize = (defaultSize / recordLength) * recordLength; {code} >> isSplittable need only verify that the file is not compressed, not that >> recordLength is sane. Moving the record length config property validation to "getSplits()" instead >> Reuse the key/value types- reading directly into them- rather than >> allocating a new byte array for each record Will do >>Please clean up unused/overly general imports. Will do >> Remove member fields that are not used after initialization. I identified two of these in FixedLengthRecordReader, will remove > Contribution: FixedLengthInputFormat and FixedLengthRecordReader > ---------------------------------------------------------------- > > Key: MAPREDUCE-1176 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 > Project: Hadoop Map/Reduce > Issue Type: New Feature > Affects Versions: 0.20.1, 0.20.2 > Environment: Any > Reporter: BitsOfInfo > Priority: Minor > Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, > MAPREDUCE-1176-v3.patch > > > Hello, > I would like to contribute the following two classes for incorporation into > the mapreduce.lib.input package. These two classes can be used when you need > to read data from files containing fixed length (fixed width) records. Such > files have no CR/LF (or any combination thereof), no delimiters etc, but each > record is a fixed length, and extra data is padded with spaces. The data is > one gigantic line within a file. > Provided are two classes first is the FixedLengthInputFormat and its > corresponding FixedLengthRecordReader. When creating a job that specifies > this input format, the job must have the > "mapreduce.input.fixedlengthinputformat.record.length" property set as follows > myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]); > OR > myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, > [myFixedRecordLength]); > This input format overrides computeSplitSize() in order to ensure that > InputSplits do not contain any partial records since with fixed records there > is no way to determine where a record begins if that were to occur. Each > InputSplit passed to the FixedLengthRecordReader will start at the beginning > of a record, and the last byte in the InputSplit will be the last byte of a > record. The override of computeSplitSize() delegates to FileInputFormat's > compute method, and then adjusts the returned split size by doing the > following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) > * fixedRecordLength) > This suite of fixed length input format classes, does not support compressed > files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.