[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006294#comment-15006294 ] zhihai xu commented on MAPREDUCE-6549: -- Nice catch! But I think this issue is not related to MAPREDUCE-6481. Without MAPREDUCE-6481, this issue will still happen. Also I think the same issue may also happen for compressed input. The attached patch only fix the issue for uncompressed input. > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Component/s: mrv2 mrv1 > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv1, mrv2 >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Attachment: MAPREDUCE-6549-2.patch The issue is related to [MAPREDUCE-6481]. That jira changed the position calculation and made sure that the full records are returned by the reader as expected. It did not anticipate the record duplication. Junit tests also did not cover the use cases correctly to discover the issue. The problem is limited to multi byte delimiters only as far as I can trace. The junit tests for the multi byte delimiter only take the best case scenario into account. The input data contained the exact delimiter and no ambiguous characters. As soon as the test is changed, either the delimiter or the input data, a failure will be triggered. The issue with the failure is that it does not clearly show when and how it fails. Analysis of the test failures shows that a complex combination of input data, split and buffer size will trigger a failure. Based on testing the duplication of the record occurs only if: - the first character(s) of the delimiter are part of the record data, example: 1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not followed by {{=}} 2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not followed by {{=}} - the delimiter character is found at the split boundary: the last character before the split ends - a fill of the buffer is triggered to finish processing the record The underlying problem is that we set a flag called {{needAdditionalRecord}} in the {{UncompressedSplitLineReader}} when we fill the buffer and have encountered part of a delimiter in combination with a split. We keep track of this in the ambiguous character number. However is it turns out that if the character(s) found after that point do not belong to a delimiter we do not unset the {{needAdditionalRecord}}. This causes the next record to be read twice and thus we see a duplication of records. The solution would be to unset the flag when we detect that we're not processing a delimiter. We currently only add the ambiguous characters to the record read and set the number back to 0. At the same point we need to unset the flag. The patch was developed based on junit tests that exercise the split and buffer settings in combination with multiple delimiter types using different inputs. All cases now provide a consistent count of records and correct position inside the data. > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Patch Available (was: Open) > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006080#comment-15006080 ] Wilfred Spiegelenburg commented on MAPREDUCE-6549: -- [~cotedm] I have picked up the jira and have a fully tested and working patch for the issue > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg reassigned MAPREDUCE-6549: Assignee: Wilfred Spiegelenburg (was: Dustin Cote) > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Wilfred Spiegelenburg > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records
[ https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wilfred Spiegelenburg updated MAPREDUCE-6549: - Status: Open (was: Patch Available) I tried the change that you made in the patch and it fails the current tests. The patch changes one test (TestLineRecordReader.java) but we have two versions. The mapred version is unchanged and now fails. The mapreduce version works but as soon as I change the delimiter back it also fails. That means that the change does not fix the issue. it also brings the two tests out of sync which is not correct > multibyte delimiters with LineRecordReader cause duplicate records > -- > > Key: MAPREDUCE-6549 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549 > Project: Hadoop Map/Reduce > Issue Type: Bug >Affects Versions: 2.7.2 >Reporter: Dustin Cote >Assignee: Dustin Cote > Attachments: MAPREDUCE-6549-1.patch > > > LineRecorderReader currently produces duplicate records under certain > scenarios such as: > 1) input string: "abc+++def++ghi++" > delimiter string: "+++" > test passes with all sizes of the split > 2) input string: "abc++def+++ghi++" > delimiter string: "+++" > test fails with a split size of 4 > 2) input string: "abc+++def++ghi++" > delimiter string: "++" > test fails with a split size of 5 > 3) input string "abc+++defg++hij++" > delimiter string: "++" > test fails with a split size of 4 > 4) input string "abc++def+++ghi++" > delimiter string: "++" > test fails with a split size of 9 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours
[ https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005942#comment-15005942 ] Steve Loughran commented on MAPREDUCE-6548: --- modifying jobclient will do nothing for any other work submitted to the YARN scheduler: hive+Tez, spark on yarn, etc. what about having a special queue for the critical job which can pre-empt others' work? That way even if the cluster is full, your job wins. And their stuff just gets blocked until there is capacity > Jobs executed can be configurated with specific users and time hours > > > Key: MAPREDUCE-6548 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: MAPREDUCE-6548.001.patch > > > In recent hadoop versions,the system has no limitation for users to execute > their jobs if you don't configurate ACL.And I find that the ACL is only > called in IPC, isn't operated in job submissions.And this condition can't > satisfied with this case that I have a very important job, and I am prepared > to execute this job in 0 to 9 o'clock.In order to let this job executed > quickly, I am not allowed other user's job to execute in these time. So I can > see the result in tomorrow morning.So may be we can let jobs executed with > specific users in specific time hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6542) HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe
[ https://issues.apache.org/jira/browse/MAPREDUCE-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005935#comment-15005935 ] Daniel Templeton commented on MAPREDUCE-6542: - [~piaoyu zhang], thanks for the new patch. We're getting closer. Three more things I'd like to see: # Mark the old {{getFormattedTimeWithDiff()}} method as {{@Deprecated}} # Refactor both the {{getFormattedTimeWithDiff()}} methods to call a new private method that contains all the replicated logic. Something like {{private static getFormattedTimeWithDiff(String formattedFinishTime, long startTime)}} # Use {{StringBuilder}} instead of {{StringBuffer}}. {{StringBuffer}} is MT safe, but the places where it's used are locally scoped, so necessarily single-threaded. {{StringBuilder}} is faster because it's not MT-safe. (http://stackoverflow.com/questions/355089/stringbuilder-and-stringbuffer) If you're feeling ambitious, it would also be nice to add a unit test that tries to force concurrent access to the function. > HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe > - > > Key: MAPREDUCE-6542 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6542 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: jobhistoryserver >Affects Versions: 2.2.0, 2.7.1 > Environment: CentOS6.5 Hadoop >Reporter: zhangyubiao >Assignee: zhangyubiao > Attachments: MAPREDUCE-6542-v2.patch, MAPREDUCE-6542-v3.patch, > MAPREDUCE-6542.patch > > > I use SimpleDateFormat to Parse the JobHistory File before > private static final SimpleDateFormat dateFormat = > new SimpleDateFormat("-MM-dd HH:mm:ss"); > public static String getJobDetail(JobInfo job) { > StringBuffer jobDetails = new StringBuffer(""); > SummarizedJob ts = new SummarizedJob(job); > jobDetails.append(job.getJobId().toString().trim()).append("\t"); > jobDetails.append(job.getUsername()).append("\t"); > jobDetails.append(job.getJobname().replaceAll("\\n", > "")).append("\t"); > jobDetails.append(job.getJobQueueName()).append("\t"); > jobDetails.append(job.getPriority()).append("\t"); > jobDetails.append(job.getJobConfPath()).append("\t"); > jobDetails.append(job.getUberized()).append("\t"); > > jobDetails.append(dateFormat.format(job.getSubmitTime())).append("\t"); > > jobDetails.append(dateFormat.format(job.getLaunchTime())).append("\t"); > > jobDetails.append(dateFormat.format(job.getFinishTime())).append("\t"); >return jobDetails.toString(); > } > But I find I query the SubmitTime and LaunchTime in hive and compare > JobHistory File time , I find that the submitTime and launchTime was wrong. > Finally,I chang to use the FastDateFormat to parse the time format and the > time become right > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours
[ https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lin Yiqun updated MAPREDUCE-6548: - Affects Version/s: 2.7.1 > Jobs executed can be configurated with specific users and time hours > > > Key: MAPREDUCE-6548 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission >Affects Versions: 2.7.1 >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: MAPREDUCE-6548.001.patch > > > In recent hadoop versions,the system has no limitation for users to execute > their jobs if you don't configurate ACL.And I find that the ACL is only > called in IPC, isn't operated in job submissions.And this condition can't > satisfied with this case that I have a very important job, and I am prepared > to execute this job in 0 to 9 o'clock.In order to let this job executed > quickly, I am not allowed other user's job to execute in these time. So I can > see the result in tomorrow morning.So may be we can let jobs executed with > specific users in specific time hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours
[ https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005817#comment-15005817 ] Lin Yiqun commented on MAPREDUCE-6548: -- Hi [~bikassaha],I look through the jira YARN-1051.I think it tends to let jobs executed with the RM at admission control time.And these jobs guaranteed enough resources.But the main of my idea is to limit users/jobs and limit time-hours with a simple check in job-client. And it's not only for guaranteed the important jobs have enough resources, that's a satisfied scenario of this function.So I will have a users/time admission control and don't allowed the unsatisfied jobs to be submited to cluster. And this function is easy to be achieved in job-client rather than in YRAN. > Jobs executed can be configurated with specific users and time hours > > > Key: MAPREDUCE-6548 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: job submission >Reporter: Lin Yiqun >Assignee: Lin Yiqun > Attachments: MAPREDUCE-6548.001.patch > > > In recent hadoop versions,the system has no limitation for users to execute > their jobs if you don't configurate ACL.And I find that the ACL is only > called in IPC, isn't operated in job submissions.And this condition can't > satisfied with this case that I have a very important job, and I am prepared > to execute this job in 0 to 9 o'clock.In order to let this job executed > quickly, I am not allowed other user's job to execute in these time. So I can > see the result in tomorrow morning.So may be we can let jobs executed with > specific users in specific time hours. -- This message was sent by Atlassian JIRA (v6.3.4#6332)