[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819061#comment-13819061 ] Mariappan Asokan commented on MAPREDUCE-1176: - Hi Sandy, Thanks for reviewing the patch. I have followed all your suggestions and uploaded a new patch. Please review it. By the way, the seed for the random number generator is already logged. Am I missing something? Please let me know. -- Asokan Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch, mapreduce-1176_v1.patch, mapreduce-1176_v2.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819148#comment-13819148 ] Hadoop QA commented on MAPREDUCE-1176: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12613169/mapreduce-1176_v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapred.TestJobCleanup The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.v2.TestUberAM {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4188//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/4188//console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.2.0 Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch, mapreduce-1176_v1.patch, mapreduce-1176_v2.patch, mapreduce-1176_v3.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13819299#comment-13819299 ] Sandy Ryza commented on MAPREDUCE-1176: --- The test failures are unrelated - we're seeing them on other JIRAs as well. +1. Will commit this later today or tomorrow unless anybody has additional concerns. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.2.0 Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch, mapreduce-1176_v1.patch, mapreduce-1176_v2.patch, mapreduce-1176_v3.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13817879#comment-13817879 ] Sandy Ryza commented on MAPREDUCE-1176: --- Thanks for picking this up [~masokan]. A few minor stylistic nits: {code} + private static final Log LOG + = LogFactory.getLog(FixedLengthRecordReader.class); {code} {code} + CompressionInputStream cIn + = codec.createInputStream(fileIn, decompressor); {code} {code} + public static final String FIXED_RECORD_LENGTH = + fixedlengthinputformat.record.length; {code)} The second line should only be indented four spaces past where the text on the first line starts. There might be a couple more of these to fix. {code} + while(numBytesToRead 0) { {code} Need a space after while {code} +if (numBytesRead == -1) // EOF + break; {code} Curly braces should be used even in one line if blocks. This applies to a couple other places in the patch as well. {code} + if (! isCompressedInput) { {code} No space needed between the exclamation point and the variable name. {code} +return(null == codec); {code} Add a space after return. Can we rename numRecordsInSplit to numRecordsRemainingInSplit to emphasize that it gets decremented as we read? If we're going to include random tests, we should log the seed loudly so that failures can be reproduced. If possible, the static block at the beginning of the test class should be moved to a static method with the JUnit BeforeClass annotation. Other than these, the patch looks good to me. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch, mapreduce-1176_v1.patch, mapreduce-1176_v2.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13762009#comment-13762009 ] Mariappan Asokan commented on MAPREDUCE-1176: - TestUberAM timeouts due to a different reason not related to this patch. See MAPREDUCE-5481. -- Asokan Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: mapreduce-1176_v1.patch, MAPREDUCE-1176-v1.patch, mapreduce-1176_v2.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756652#comment-13756652 ] Mariappan Asokan commented on MAPREDUCE-1176: - I am posting a patch for the implementation of both old and new APIs for FixedLengthInputFormat. Compressed input is supported. I thought of supporting splittable compressed input as multiple splits. However, the uncompressed size of a split is not available for splittable compressed input. So any compressed input is treated as one split. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: mapreduce-1176_v1.patch, MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756736#comment-13756736 ] Hadoop QA commented on MAPREDUCE-1176: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12601168/mapreduce-1176_v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:red}-1 javac{color}. The applied patch generated 1150 javac compiler warnings (more than the trunk's current 1148 warnings). {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.v2.TestUberAM {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3977//testReport/ Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3977//artifact/trunk/patchprocess/diffJavacWarnings.txt Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3977//console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: mapreduce-1176_v1.patch, MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13756864#comment-13756864 ] Mariappan Asokan commented on MAPREDUCE-1176: - Got rid of javac warnings and uploaded a new patch. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: mapreduce-1176_v1.patch, MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13757004#comment-13757004 ] Hadoop QA commented on MAPREDUCE-1176: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12601195/mapreduce-1176_v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 eclipse:eclipse{color}. The patch built with eclipse:eclipse. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The following test timeouts occurred in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient: org.apache.hadoop.mapreduce.v2.TestUberAM {color:green}+1 contrib tests{color}. The patch passed contrib unit tests. Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3978//testReport/ Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/3978//console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 2.1.0-beta, 2.0.5-alpha Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: mapreduce-1176_v1.patch, MAPREDUCE-1176-v1.patch, mapreduce-1176_v2.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13736956#comment-13736956 ] Mariappan Asokan commented on MAPREDUCE-1176: - I went over the code for TextInputFormat. Here are my conclusions: * I think we can make FixedLengthInputFormat look very similar to TextInputFormat. Specifically the key can be LongWritable(which indicates the position of the record in the file) and value can be BytesWritable(since the records can contain arbitrary binary data.) Also the implementation can be simpler and similar to TextInputFormat. There is no need for custom key and value settings. * Splittable compressed input can be supported. * Since the start location of each split is available(in a FileSplit object) it is easy to compute the number of bytes to skip at the beginning of each split. I will proceed with the implementation and post a patch. Also, I raised MAPREDUCE-5455 to support the complement namely FixedLengthOutputFormat. With these, I think we can cut down some CPU time in TeraSort benchmark since the records are 100 bytes long and have fixed lengths. There is no need for byte-by-byte scanning to identify records. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13733518#comment-13733518 ] BitsOfInfo commented on MAPREDUCE-1176: --- Thanks, anyways, the implementation I originally threw together worked great for our needs. If this is an optimization for it in just computing the split size, then go ahead. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Assignee: Mariappan Asokan Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731036#comment-13731036 ] Mariappan Asokan commented on MAPREDUCE-1176: - I was looking for an implementation of this record format as well. I agree with the following comment by Todd: {quote} As a general note, I'm not sure I agree with the design here. Rather than forcing the split to lie on record boundaries, I think it would be simpler to simply let FileInputFormat compute its own splits, and then when you first open the record reader, skip forward to the next record boundary and begin reading from there. Then for the last record of the file, over read your split into the beginning of the next one. This is the strategy that other input formats take, and should be compatible with the splittable compression codecs (see TextInputFormat for example). {quote} I think we should support fixed length records spanning across HDFS blocks. BitsOfInfo, do you mind if I pick up your patch, enhance it to take care of the above case, and post a patch for the trunk? I would appreciate if a committer can come forward to review the patch and commit it to the trunk. Thanks. -- Asokan Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731076#comment-13731076 ] Debashis Saha commented on MAPREDUCE-1176: -- The reason other input format takes that approach is they don't have any other way to figure out exact boundary. With fixed format you can exactly know the boundary and in my opinion you should take advantage of it. -- - Deba --~O~-- Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731089#comment-13731089 ] Mariappan Asokan commented on MAPREDUCE-1176: - Hi Debashis, You are correct. It is easy to identify records spanning across HDFS blocks. -- Asokan Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731149#comment-13731149 ] BitsOfInfo commented on MAPREDUCE-1176: --- Asokan: Sure go ahead make whatever changes are necessary; as I have no time to work on this anymore; yet would like to see this put into the project as I had a use for it when I created it and I'm sure others do as well. BTW: Never had my original question answered from a few years ago in regards to the design, maybe I'm was missing something. bq. Hmm, ok, do you have suggestion on how I detect where one record begins and one record ends when records are not identifiable by any sort of consistent start character or end character boundary but just flow together? I could see the RecordReader detecting that it only read RECORD LENGTH bytes and hitting the end of the split and discarding it. But I am not sure how it would detect the start of a record, with a split that has partial data at the start of it. Especially if there is no consistent boundary/char marker that identifies the start of a record. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13731239#comment-13731239 ] Mariappan Asokan commented on MAPREDUCE-1176: - BitsOfInfo, For each split, you need to compute how many bytes to skip(to account partial record that spans across previous and current splits.) Let us say we are processing split N(where N is a 0-based number) in the record reader, Z is the cumulative total of split sizes for splits from 0 thru N-1, L is the record length, and S is the number of bytes to skip at the beginning of split N. When N = 0, S = 0 and for all other N, S = L - (Z mod L) The record reader should account the last record in a split by reading additional bytes from next split if necessary. Hope I clarified the logic. -- Asokan Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13720189#comment-13720189 ] karuth sanker commented on MAPREDUCE-1176: -- I agree with Jonathan, this feature is very critical to do lot of files I deal with. Chris and others, Can you please allow this feature to be included? Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13591931#comment-13591931 ] Jonathan Clark commented on MAPREDUCE-1176: --- This patch appeared to be within inches of being incorporated 3 years ago and this feature is still sorely missing from Hadoop. Where do things stand now? Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12841730#action_12841730 ] Hadoop QA commented on MAPREDUCE-1176: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12434480/MAPREDUCE-1176-v4.patch against trunk revision 919277. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/22/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/22/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/22/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h4.grid.sp2.yahoo.net/22/console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch, MAPREDUCE-1176-v4.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789159#action_12789159 ] Hadoop QA commented on MAPREDUCE-1176: -- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12426619/MAPREDUCE-1176-v3.patch against trunk revision 889496. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/187/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/187/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/187/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h3.grid.sp2.yahoo.net/187/console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787905#action_12787905 ] Todd Lipcon commented on MAPREDUCE-1176: To retrigger hudson, hit Cancel patch and then Submit patch again Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch, MAPREDUCE-1176-v3.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779998#action_12779998 ] BitsOfInfo commented on MAPREDUCE-1176: --- What is the next step for this issue? Does anything else need to be submitted? Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780225#action_12780225 ] Todd Lipcon commented on MAPREDUCE-1176: Hi, Just had a chance to look at your patch in more detail now that the initial formatting issues have been straightened out: - Please provide a static setter for the mapreduce.input.fixedlengthinputformat.record.length configuration property, rather than having users have to specify this property manually. An example of this is in o.a.h.mapreduce.lib.input.NLineInputFormat.setNumLinesPerSplit. Also update the javadoc and tests to refer to this new method - The following code makes me nervous: {code} + long splitSize = + ((long)(Math.floor((double)defaultSize / + (double)recordLength))) * recordLength; {code} Why can't you just keep defaultSize and recordLength as longs? Division of longs will give you floor-like behavior and you won't have to worry about floating point inaccuracies. - In isSplitable, you catch the exception generated by getRecordLength and turn off splitting. If there is no record length specified doesn't that mean the input format won't work at all? - FixedLengthRecordReader: This record reader does not support compressed files. Is this true? Or does it just not support *splitting* compressed files? I see that you've explicitly disallowed it, but I don't understand this decision. - Throughout, you've still got 4-space indentation in the method bodies. Indentation should be by 2. - In FixedLengthRecordReader, you hard code a 64KB buffer. Why's this? You should let the filesystem use its default. - In your read loop, you're not accounting for the case of read returning 0 or -1, which I believe can happen at EOF, right? Consider using o.a.h.io.IOUtils.readFully() to replace this loop. As a general note, I'm not sure I agree with the design here. Rather than forcing the split to lie on record boundaries, I think it would be simpler to simply let FileInputFormat compute its own splits, and then when you first open the record reader, skip forward to the next record boundary and begin reading from there. Then for the last record of the file, over read your split into the beginning of the next one. This is the strategy that other input formats take, and should be compatible with the splittable compression codecs (see TextInputFormat for example). I don't want to harp too much on the compression thing, in my experience the sorts of datasets that have these fixed-length records are very highly compressible - lots and lots of numeric fields/UPCs/zipcodes/etc. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780367#action_12780367 ] BitsOfInfo commented on MAPREDUCE-1176: --- Why can't you just keep defaultSize and recordLength as longs? Because the findbugs threw warnings if they were not cast, secondly the code works as expected. Please just shoot over how you want that calculation re-written and I can certainly change it. - In isSplitable, you catch the exception generated by getRecordLength and turn off splitting. If there is no record length specified doesn't that mean the input format won't work at all? Nope, it would still work, as I have yet to see an original raw data file containing records of a fixed width, that for some reason does not contain complete records. But that's fine, we can just exit out here to let the user know they need to configure that property. If there is a better place to check for the existence of that property please let me know. - FixedLengthRecordReader: This record reader does not support compressed files. Is this true? Correct, as stated in the docs. Reason being is that in my case, when I wrote this I was not dealing with compressed files. Secondly, if a input file were compressed, I was not sure the procedure to properly compute the splits against a file that is compressed and the byte lengths of the records would be different in a compressed form, vs. once passed to the RecordReader. - Throughout, you've still got 4-space indentation in the method bodies. Indentation should be by 2. Does anyone know of a automated tool that will fix this? Driving me nut going line by line and hitting delete 2x.. When I look at this in eclipse I am not seeing 4 spaces. - In FixedLengthRecordReader, you hard code a 64KB buffer. Why's this? You should let the filesystem use its default. Sure, I can get rid of that - In your read loop, you're not accounting for the case of read returning 0 or -1, which I believe can happen at EOF, right? Consider using o.a.h.io.IOUtils.readFully() to replace this loop. Ditto, I can change to that. As a general note, I'm not sure I agree with the design here. Rather than forcing the split to lie on record boundaries, Ok, thats fine, I just wanted to contribute what I wrote that is working for my case. open the record reader, skip forward to the next record boundary Hmm, ok, do you have suggestion on how I detect where one record begins and one record ends when records are not identifiable by any sort of consistent start character or end character boundary but just flow together? I could see the RecordReader detecting that it only read RECORD LENGTH bytes and hitting the end of the split and discarding it. But I am not sure how it would detect the start of a record, with a split that has partial data at the start of it. Especially if there is no consistent boundary/char marker that identifies the start of a record. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: MAPREDUCE-1176-v1.patch, MAPREDUCE-1176-v2.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1295#action_1295 ] Hadoop QA commented on MAPREDUCE-1176: -- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12424924/FixedLengthRecordReader.java against trunk revision 836063. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/243/console This message is automatically generated. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: FixedLengthInputFormat.java, FixedLengthRecordReader.java, MAPREDUCE-1176-v1.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1298#action_1298 ] Todd Lipcon commented on MAPREDUCE-1176: Hi, - Please *only* upload the patch to Hudson. Otherwise the QA bot gets confused and tries to apply your .java files as a patch. - Also, the coding style guidelines for Hadoop are have an indentation level of 2 spaces. It looks like your patch is full of tabs. There are a few other style violations. The coding style is http://java.sun.com/docs/codeconv/ with the change of 2 spaces instead of 4. It's probably easier to look through other parts of the Hadoop codebase and simply follow their example. - There's a comment referring to the 0.20.1 code. Since this patch is slated for trunk, not 0.20.1, please remove that. - There are some other bits of commented-out code. These are a no-no - either the code works and is important, in which case it should be there, or it's not important (or broken) and it shouldn't. Thanks again for contributing to Hadoop! The review process can take a while but it's important to maintain style consistency across the codebase. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: FixedLengthInputFormat.java, FixedLengthRecordReader.java, MAPREDUCE-1176-v1.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12777802#action_12777802 ] BitsOfInfo commented on MAPREDUCE-1176: --- I followed the instructions listed @ http://wiki.apache.org/hadoop/HowToContribute Finally, patches should be attached to an issue report in Jira via the Attach File link on the issue's Jira. When you believe that your patch is ready to be committed, select the Submit Patch link on the issue's Jira. So are you saying to delete the 2 *.java files and only upload the .patch? The *.patch file does contain a unit test in it so I am not sure why the comment above reported no tests were included. I ran this patch file against a clean trunk copy locally on my test machine and also verified it was ok through the test-patch task on the contribute how-to page. I'll remove the 4 space indents, when looking through other sources I found the 2 spaces and tons of wrapping unreadable. Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1, 0.20.2 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: FixedLengthInputFormat.java, FixedLengthRecordReader.java, MAPREDUCE-1176-v1.patch Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1176) Contribution: FixedLengthInputFormat and FixedLengthRecordReader
[ https://issues.apache.org/jira/browse/MAPREDUCE-1176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12772397#action_12772397 ] Todd Lipcon commented on MAPREDUCE-1176: Hi, Could you please post this as a patch file against the hadoop-mapreduce trunk? This will allow Hudson to automatically test the change. Also, a couple notes: - please include the Apache license header at the top of these files. - @author tags are discouraged in Apache projects - Please include unit tests for this new code. Thanks for the contribution - look forward to seeing this in trunk! Contribution: FixedLengthInputFormat and FixedLengthRecordReader Key: MAPREDUCE-1176 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1176 Project: Hadoop Map/Reduce Issue Type: New Feature Affects Versions: 0.20.1 Environment: Any Reporter: BitsOfInfo Priority: Minor Attachments: FixedLengthInputFormat.java, FixedLengthRecordReader.java Hello, I would like to contribute the following two classes for incorporation into the mapreduce.lib.input package. These two classes can be used when you need to read data from files containing fixed length (fixed width) records. Such files have no CR/LF (or any combination thereof), no delimiters etc, but each record is a fixed length, and extra data is padded with spaces. The data is one gigantic line within a file. Provided are two classes first is the FixedLengthInputFormat and its corresponding FixedLengthRecordReader. When creating a job that specifies this input format, the job must have the mapreduce.input.fixedlengthinputformat.record.length property set as follows myJobConf.setInt(mapreduce.input.fixedlengthinputformat.record.length,[myFixedRecordLength]); OR myJobConf.setInt(FixedLengthInputFormat.FIXED_RECORD_LENGTH, [myFixedRecordLength]); This input format overrides computeSplitSize() in order to ensure that InputSplits do not contain any partial records since with fixed records there is no way to determine where a record begins if that were to occur. Each InputSplit passed to the FixedLengthRecordReader will start at the beginning of a record, and the last byte in the InputSplit will be the last byte of a record. The override of computeSplitSize() delegates to FileInputFormat's compute method, and then adjusts the returned split size by doing the following: (Math.floor(fileInputFormatsComputedSplitSize / fixedRecordLength) * fixedRecordLength) This suite of fixed length input format classes, does not support compressed files. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.