[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread zhihai xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006294#comment-15006294
 ] 

zhihai xu commented on MAPREDUCE-6549:
--

Nice catch! But I think this issue is not related to MAPREDUCE-6481. Without 
MAPREDUCE-6481, this issue will still happen. Also I think the same issue may 
also happen for compressed input. The attached patch only fix the issue for 
uncompressed input.

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Component/s: mrv2
 mrv1

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: mrv1, mrv2
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Attachment: MAPREDUCE-6549-2.patch

The issue is related to [MAPREDUCE-6481]. That jira changed the position 
calculation and made sure that the full records are returned by the reader as 
expected. It did not anticipate the record duplication. Junit tests also did 
not cover the use cases correctly to discover the issue.
The problem is limited to multi byte delimiters only as far as I can trace. 

The junit tests for the multi byte delimiter only take the best case scenario 
into account. The input data contained the exact delimiter and no ambiguous 
characters. As soon as the test is changed, either the delimiter or the input 
data, a failure will be triggered. The issue with the failure is that it does 
not clearly show when and how it fails. Analysis of the test failures shows 
that a complex combination of input data, split and buffer size will trigger a 
failure.

Based on testing the duplication of the record occurs only if:
- the first character(s) of the delimiter are part of the record data, example: 
  1) the delimiter is {{\+=}} and the data contains a {{\+}} and is not 
followed by {{=}}
  2) the delimiter is {{\+=\+=}} and the data contains {{\+=\+}} and is not 
followed by {{=}}
- the delimiter character is found at the split boundary: the last character 
before the split ends
- a fill of the buffer is triggered to finish processing the record

The underlying problem is that we set a flag called {{needAdditionalRecord}} in 
the {{UncompressedSplitLineReader}} when we fill the buffer and have 
encountered part of a delimiter in combination with a split. We keep track of 
this in the ambiguous character number. However is it turns out that if the 
character(s) found after that point do not belong to a delimiter we do not 
unset the {{needAdditionalRecord}}. This causes the next record to be read 
twice and thus we see a duplication of records.
The solution would be to unset the flag when we detect that we're not 
processing a delimiter. We currently only add the ambiguous characters to the 
record read and set the number back to 0. At the same point we need to unset 
the flag.

The patch was developed based on junit tests that exercise the split and buffer 
settings in combination with multiple delimiter types using different inputs. 
All cases now provide a consistent count of records and correct position inside 
the data.

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Patch Available  (was: Open)

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch, MAPREDUCE-6549-2.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15006080#comment-15006080
 ] 

Wilfred Spiegelenburg commented on MAPREDUCE-6549:
--

[~cotedm] I have picked up the jira and have a fully tested and working patch 
for the issue

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg reassigned MAPREDUCE-6549:


Assignee: Wilfred Spiegelenburg  (was: Dustin Cote)

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Wilfred Spiegelenburg
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6549) multibyte delimiters with LineRecordReader cause duplicate records

2015-11-15 Thread Wilfred Spiegelenburg (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wilfred Spiegelenburg updated MAPREDUCE-6549:
-
Status: Open  (was: Patch Available)

I tried the change that you made in the patch and it fails the current tests.
The patch changes one test (TestLineRecordReader.java) but we have two 
versions. The mapred version is unchanged and now fails. The mapreduce version 
works but as soon as I change the delimiter back it also fails. That means that 
the change does not fix the issue.

it also brings the two tests out of sync which is not correct

> multibyte delimiters with LineRecordReader cause duplicate records
> --
>
> Key: MAPREDUCE-6549
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6549
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 2.7.2
>Reporter: Dustin Cote
>Assignee: Dustin Cote
> Attachments: MAPREDUCE-6549-1.patch
>
>
> LineRecorderReader currently produces duplicate records under certain 
> scenarios such as:
> 1) input string: "abc+++def++ghi++" 
> delimiter string: "+++" 
> test passes with all sizes of the split 
> 2) input string: "abc++def+++ghi++" 
> delimiter string: "+++" 
> test fails with a split size of 4 
> 2) input string: "abc+++def++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 5 
> 3) input string "abc+++defg++hij++" 
> delimiter string: "++" 
> test fails with a split size of 4 
> 4) input string "abc++def+++ghi++" 
> delimiter string: "++" 
> test fails with a split size of 9 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours

2015-11-15 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005942#comment-15005942
 ] 

Steve Loughran commented on MAPREDUCE-6548:
---

modifying jobclient will do nothing for any other work submitted to the YARN 
scheduler: hive+Tez, spark on yarn, etc.

what about having a special queue for the critical job which can pre-empt 
others' work?  That way even if the cluster is full, your job wins. And their 
stuff just gets blocked until there is capacity

> Jobs executed can be configurated with specific users and time hours
> 
>
> Key: MAPREDUCE-6548
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: MAPREDUCE-6548.001.patch
>
>
> In recent hadoop versions,the system has no limitation for users to execute 
> their jobs if you don't configurate ACL.And I find that the ACL is only 
> called in IPC, isn't operated in job submissions.And this condition can't 
> satisfied with this case that I have a very important job, and I am prepared 
> to execute this job in 0 to 9 o'clock.In order to let this job executed 
> quickly, I am not allowed other user's job to execute in these time. So I can 
> see the result in tomorrow morning.So may be we can let jobs executed with 
> specific users in specific time hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6542) HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe

2015-11-15 Thread Daniel Templeton (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005935#comment-15005935
 ] 

Daniel Templeton commented on MAPREDUCE-6542:
-

[~piaoyu zhang], thanks for the new patch.  We're getting closer.  Three more 
things I'd like to see:

# Mark the old {{getFormattedTimeWithDiff()}} method as {{@Deprecated}}
# Refactor both the {{getFormattedTimeWithDiff()}} methods to call a new 
private method that contains all the replicated logic.  Something like 
{{private static getFormattedTimeWithDiff(String formattedFinishTime, long 
startTime)}}
# Use {{StringBuilder}} instead of {{StringBuffer}}.  {{StringBuffer}} is MT 
safe, but the places where it's used are locally scoped, so necessarily 
single-threaded.  {{StringBuilder}} is faster because it's not MT-safe.  
(http://stackoverflow.com/questions/355089/stringbuilder-and-stringbuffer)

If you're feeling ambitious, it would also be nice to add a unit test that 
tries to force concurrent access to the function.

> HistoryViewer use SimpleDateFormat,But SimpleDateFormat is not threadsafe
> -
>
> Key: MAPREDUCE-6542
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6542
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobhistoryserver
>Affects Versions: 2.2.0, 2.7.1
> Environment: CentOS6.5 Hadoop  
>Reporter: zhangyubiao
>Assignee: zhangyubiao
> Attachments: MAPREDUCE-6542-v2.patch, MAPREDUCE-6542-v3.patch, 
> MAPREDUCE-6542.patch
>
>
> I use SimpleDateFormat to Parse the JobHistory File before 
> private static final SimpleDateFormat dateFormat =
> new SimpleDateFormat("-MM-dd HH:mm:ss");
>  public static String getJobDetail(JobInfo job) {
> StringBuffer jobDetails = new StringBuffer("");
> SummarizedJob ts = new SummarizedJob(job);
> jobDetails.append(job.getJobId().toString().trim()).append("\t");
> jobDetails.append(job.getUsername()).append("\t");
> jobDetails.append(job.getJobname().replaceAll("\\n", 
> "")).append("\t");
> jobDetails.append(job.getJobQueueName()).append("\t");
> jobDetails.append(job.getPriority()).append("\t");
> jobDetails.append(job.getJobConfPath()).append("\t");
> jobDetails.append(job.getUberized()).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getSubmitTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getLaunchTime())).append("\t");
> 
> jobDetails.append(dateFormat.format(job.getFinishTime())).append("\t");
>return jobDetails.toString();
> }
> But I find I query the SubmitTime and LaunchTime in hive and compare 
> JobHistory File time , I find that the submitTime  and launchTime was wrong.
> Finally,I chang to use the FastDateFormat to parse the time format and the 
> time become right 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours

2015-11-15 Thread Lin Yiqun (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lin Yiqun updated MAPREDUCE-6548:
-
Affects Version/s: 2.7.1

> Jobs executed can be configurated with specific users and time hours
> 
>
> Key: MAPREDUCE-6548
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Affects Versions: 2.7.1
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: MAPREDUCE-6548.001.patch
>
>
> In recent hadoop versions,the system has no limitation for users to execute 
> their jobs if you don't configurate ACL.And I find that the ACL is only 
> called in IPC, isn't operated in job submissions.And this condition can't 
> satisfied with this case that I have a very important job, and I am prepared 
> to execute this job in 0 to 9 o'clock.In order to let this job executed 
> quickly, I am not allowed other user's job to execute in these time. So I can 
> see the result in tomorrow morning.So may be we can let jobs executed with 
> specific users in specific time hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MAPREDUCE-6548) Jobs executed can be configurated with specific users and time hours

2015-11-15 Thread Lin Yiqun (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15005817#comment-15005817
 ] 

Lin Yiqun commented on MAPREDUCE-6548:
--

Hi [~bikassaha],I look through the jira YARN-1051.I think it tends to let jobs 
executed with the RM at admission control time.And these jobs guaranteed enough 
resources.But the main of my idea is to limit users/jobs and limit time-hours 
with a simple check in job-client. And it's not only for guaranteed the 
important jobs have enough resources, that's a satisfied scenario of this 
function.So I will have a users/time admission control and don't allowed the 
unsatisfied jobs to be submited to cluster. And this function is easy to be 
achieved in job-client rather than in YRAN.

> Jobs executed can be configurated with specific users and time hours
> 
>
> Key: MAPREDUCE-6548
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6548
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: job submission
>Reporter: Lin Yiqun
>Assignee: Lin Yiqun
> Attachments: MAPREDUCE-6548.001.patch
>
>
> In recent hadoop versions,the system has no limitation for users to execute 
> their jobs if you don't configurate ACL.And I find that the ACL is only 
> called in IPC, isn't operated in job submissions.And this condition can't 
> satisfied with this case that I have a very important job, and I am prepared 
> to execute this job in 0 to 9 o'clock.In order to let this job executed 
> quickly, I am not allowed other user's job to execute in these time. So I can 
> see the result in tomorrow morning.So may be we can let jobs executed with 
> specific users in specific time hours.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)