[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

BitsOfInfo (JIRA) Wed, 23 Dec 2009 07:55:41 -0800

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794084#action_12794084
 ]


BitsOfInfo commented on MAPREDUCE-1176:
---------------------------------------

Chris, thanks for the comments, to address these:


>>This should offer fixed key/value bytes, not just value bytes. 
>>The value type should be BytesWritable, not Text.

To clarify, so I'll modify this to change the KEY and VAL to be <BytesWritable, 
BytesWritable>, then propose to add a config property to the 
FixedLengthInputFormat to allow someone to configure the "prefix" number of 
value (record) bytes to use as the KEY, such as 
"mapreduce.input.fixedlengthinputformat.keyprefixcount" or "keyprefixbytes" 
etc. Please send over any suggestions or other ideas.



>> The double arithmetic should be replaced by modular arithmetic.

Will change to:
{code}
    // determine the split size, it should be as close as possible to the 
    // default size, but should NOT split within a record... each split
    // should contain a complete set of records with the first record
    // starting at the first byte in the split and the last record ending
    // with the last byte in the split.
    long splitSize =  (defaultSize / recordLength) * recordLength;
{code}



>> isSplittable need only verify that the file is not compressed, not that 
>> recordLength is sane.

Moving the record length config property validation to "getSplits()" instead


>> Reuse the key/value types- reading directly into them- rather than 
>> allocating a new byte array for each record

Will do


>>Please clean up unused/overly general imports.

Will do


>> Remove member fields that are not used after initialization.

I identified two of these in FixedLengthRecordReader, will remove




> Contribution: FixedLengthInputFormat and FixedLengthRecordReader
> ----------------------------------------------------------------
>
>                 Key: MAPREDUCE-1176
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176
>             Project: Hadoop Map/Reduce
>          Issue Type: New Feature
>    Affects Versions: 0.20.1, 0.20.2
>         Environment: Any
>            Reporter: BitsOfInfo
>            Priority: Minor
>         Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, 
> MAPREDUCE-1176-v3.patch
>
>
> Hello,
> I would like to contribute the following two classes for incorporation into 
> the mapreduce.lib.input package. These two classes can be used when you need 
> to read data from files containing fixed length (fixed width) records. Such 
> files have no CR/LF (or any combination thereof), no delimiters etc, but each 
> record is a fixed length, and extra data is padded with spaces. The data is 
> one gigantic line within a file.
> Provided are two classes first is the FixedLengthInputFormat and its 
> corresponding FixedLengthRecordReader. When creating a job that specifies 
> this input format, the job must have the 
> "mapreduce.input.fixedlengthinputformat.record.length" property set as follows
> myJobConf.setInt("mapreduce.input.fixedlengthinputformat.record.length",[myFixedRecordLength]);
> OR
> myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, 
> [myFixedRecordLength]);
> This input format overrides computeSplitSize() in order to ensure that 
> InputSplits do not contain any partial records since with fixed records there 
> is no way to determine where a record begins if that were to occur. Each 
> InputSplit passed to the FixedLengthRecordReader will start at the beginning 
> of a record, and the last byte in the InputSplit will be the last byte of a 
> record. The override of computeSplitSize() delegates to FileInputFormat's 
> compute method, and then adjusts the returned split size by doing the 
> following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) 
> * fixedRecordLength)
> This suite of fixed length input format classes, does not support compressed 
> files. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader

Reply via email to