[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-29 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14012350#comment-14012350
 ] 

jay vyas commented on MAPREDUCE-5902:
-

I can work on a patch for this  general agreement that better 
logging for this class would be ideal?

 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as completed.
 2) In addition to skipping these files, the JobHistoryServer doesnt 
 effectively log which files are being skipped , and why.  
 So In addition to determining why certain types of files are skipped (file 
 name length doesnt appear to be the reason, rather, it appears to be that % 
 characters throw the JobHistoryServer filter off), we should log completed 
 .jhist  files which  are available in the mr-history/tmp directory, yet they 
 are skipped for some reason. 
 *Regarding the actual bug : Skipping completed jhist files* 
 We will need an author of the JobHistoryServer, I think, to chime in on what 
 types of paths for jobs are actually valid.  It appears that at least some 
 characters, if in a job name, will make the jobhistoryserver skip recognition 
 of a completed jhist file.
 *Regarding logging*
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}
 *Reproducing* 
 I was able to reproduce this bug by writing a custom mapreduce job with a job 
 name, which contained % characters.  I have also seen this with a version of 
 the Mahout ParallelALSFactorizationJob, which includes - characters in its 
 name, which wind up getting replaced by %2D later on at some stage in the 
 job pipeline.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-28 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011240#comment-14011240
 ] 

jay vyas commented on MAPREDUCE-5902:
-

Sure I can try those.  

In general what is the contract for a Hadoop file system- should it support any 
character in a file name ? Are there certain escape sequences that have a 
particular meaning?

 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as completed.
 2) In addition to skipping these files, the JobHistoryServer doesnt 
 effectively log which files are being skipped , and why.  
 So In addition to determining why certain types of files are skipped (file 
 name length doesnt appear to be the reason, rather, it appears to be that % 
 characters throw the JobHistoryServer filter off), we should log completed 
 .jhist  files which  are available in the mr-history/tmp directory, yet they 
 are skipped for some reason. 
 *Regarding the actual bug : Skipping completed jhist files* 
 We will need an author of the JobHistoryServer, I think, to chime in on what 
 types of paths for jobs are actually valid.  It appears that at least some 
 characters, if in a job name, will make the jobhistoryserver skip recognition 
 of a completed jhist file.
 *Regarding logging*
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}
 *Reproducing* 
 I was able to reproduce this bug by writing a custom mapreduce job with a job 
 name, which contained % characters.  I have also seen this with a version of 
 the Mahout ParallelALSFactorizationJob, which includes - characters in its 
 name, which wind up getting replaced by %2D later on at some stage in the 
 job pipeline.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-28 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14011953#comment-14011953
 ] 

jay vyas commented on MAPREDUCE-5902:
-

I've confirmed that, this is a FileSystem issue:  I'm using an alternative 
filesystem, and our plugin behaves differently than HDFS.  we can go back to 
the original goal for this JIRA:

*When the JobHistoryServer SCANS directories, it should debug log exactly the  
files which it sees, so that users can clearly see if certain files arent 
readable to JHS from just looking at logs.*

 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as completed.
 2) In addition to skipping these files, the JobHistoryServer doesnt 
 effectively log which files are being skipped , and why.  
 So In addition to determining why certain types of files are skipped (file 
 name length doesnt appear to be the reason, rather, it appears to be that % 
 characters throw the JobHistoryServer filter off), we should log completed 
 .jhist  files which  are available in the mr-history/tmp directory, yet they 
 are skipped for some reason. 
 *Regarding the actual bug : Skipping completed jhist files* 
 We will need an author of the JobHistoryServer, I think, to chime in on what 
 types of paths for jobs are actually valid.  It appears that at least some 
 characters, if in a job name, will make the jobhistoryserver skip recognition 
 of a completed jhist file.
 *Regarding logging*
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}
 *Reproducing* 
 I was able to reproduce this bug by writing a custom mapreduce job with a job 
 name, which contained % characters.  I have also seen this with a version of 
 the Mahout ParallelALSFactorizationJob, which includes - characters in its 
 name, which wind up getting replaced by %2D later on at some stage in the 
 job pipeline.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-27 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5902:


Description: 
1) JobHistoryServer sometimes skips over certain history files, and ignores 
serving them as completed.

2) In addition to skipping these files, the JobHistoryServer doesnt effectively 
log which files are being skipped , and why.  

So In addition to determining why certain types of files are skipped (file name 
length doesnt appear to be the reason, rather, it appears to be that % 
characters throw the JobHistoryServer filter off), we should log completed 
.jhist  files which  are available in the mr-history/tmp directory, yet they 
are skipped for some reason. 

*Regarding the actual bug : Skipping completed jhist files* 

We will need an author of the JobHistoryServer, I think, to chime in on what 
types of paths for jobs are actually valid.  It appears that at least some 
characters, if in a job name, will make the jobhistoryserver skip recognition 
of a completed jhist file.

*Regarding logging*
It would be extremely useful , then, to have a couple of gaurded logs at this 
level of the code, so that we can see, in the log folders, why files are being 
filtered out  , i.e. it is due to filterint or visibility.

{noformat}

  private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }

{noformat}

*Reproducing* 

I was able to reproduce this bug by writing a custom mapreduce job with a job 
name, which contained % characters.  I have also seen this with a version of 
the Mahout ParallelALSFactorizationJob, which includes - characters in its 
name, which wind up getting replaced by %2D later on at some stage in the job 
pipeline.


  was:
1) JobHistoryServer sometimes skips over certain history files, and ignores 
serving them as completed.

2) In addition to skipping these files, the JobHistoryServer doesnt effectively 
log which files are being skipped , and why.  

So In addition to determining why certain types of files are skipped (file name 
length doesnt appear to be the reason, rather, it appears to be that % 
characters throw the JobHistoryServer filter off), we should log completed 
.jhist  files which  are available in the mr-history/tmp directory, yet they 
are skipped for some reason. 

** Regarding the actual bug : Skipping completed jhist files ** 

We will need an author of the JobHistoryServer, I think, to chime in on what 
types of paths for jobs are actually valid.  It appears that at least some 
characters, if in a job name, will make the jobhistoryserver skip recognition 
of a completed jhist file.

** Regarding logging **
It would be extremely useful , then, to have a couple of gaurded logs at this 
level of the code, so that we can see, in the log folders, why files are being 
filtered out  , i.e. it is due to filterint or visibility.

{noformat}

  private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }

{noformat}

** Reproducing ** 

I was able to reproduce this bug by writing a custom mapreduce job with a job 
name, which contained % characters.  I have also seen this with a version of 
the Mahout ParallelALSFactorizationJob, which includes - characters in its 
name, which wind up getting replaced by %2D later on at some stage in the job 
pipeline.



 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as 

[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-27 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14009793#comment-14009793
 ] 

jay vyas commented on MAPREDUCE-5902:
-

This is an identical jira for the web front end, so i think these should be 
linked, as they are pretty similar and happening in the same component, 
although at different parts of the stack.

 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as completed.
 2) In addition to skipping these files, the JobHistoryServer doesnt 
 effectively log which files are being skipped , and why.  
 So In addition to determining why certain types of files are skipped (file 
 name length doesnt appear to be the reason, rather, it appears to be that % 
 characters throw the JobHistoryServer filter off), we should log completed 
 .jhist  files which  are available in the mr-history/tmp directory, yet they 
 are skipped for some reason. 
 *Regarding the actual bug : Skipping completed jhist files* 
 We will need an author of the JobHistoryServer, I think, to chime in on what 
 types of paths for jobs are actually valid.  It appears that at least some 
 characters, if in a job name, will make the jobhistoryserver skip recognition 
 of a completed jhist file.
 *Regarding logging*
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}
 *Reproducing* 
 I was able to reproduce this bug by writing a custom mapreduce job with a job 
 name, which contained % characters.  I have also seen this with a version of 
 the Mahout ParallelALSFactorizationJob, which includes - characters in its 
 name, which wind up getting replaced by %2D later on at some stage in the 
 job pipeline.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5805) Unable to parse launch time from job history file

2014-05-27 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14010737#comment-14010737
 ] 

jay vyas commented on MAPREDUCE-5805:
-

Any possible relation of this to MAPREDUCE-5902 ?

 Unable to parse launch time from job history file
 -

 Key: MAPREDUCE-5805
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5805
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Affects Versions: 2.3.0
Reporter: Fengdong Yu
Assignee: Akira AJISAKA
 Fix For: 2.4.0

 Attachments: MAPREDUCE-5805.patch


 when job complete, there are WARN complains in the log:
 {code}
 2014-03-19 13:31:10,036 WARN 
 org.apache.hadoop.mapreduce.v2.jobhistory.FileNameIndexUtils: Unable to parse 
 launch time from job history file 
 job_1395204058904_0003-1395206473646-root-test_one_word-1395206966214-4-2-SUCCEEDED-root.test-queue-1395206480070.jhist
  : java.lang.NumberFormatException: For input string: queue
 {code}
 because  there is (-)  in the queue name 'test-queue', we split the job 
 history file name by (-), and get the ninth item as job start time.
 FileNameIndexUtils.java
 {code}
 private static final int JOB_START_TIME_INDEX = 9;
 {code}
 but there is another potential issue:
 if I also include '-' in the job name(test_one_world in this case), there are 
 all misunderstand.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs.

2014-05-23 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007441#comment-14007441
 ] 

jay vyas commented on MAPREDUCE-5902:
-

FYI, a concrete example:  These paths, whose job names seem to have been 
truncated at some point i.e. {{ItemRatingVectorsMappe}} is clearly missing an 
R .. are not getting picked up by the JobHistoryServer .  

{noformat}
└── tom
├── 
job_1400794299637_0010-1400808860349-tom-ParallelALSFactorizationJob%2DItemRatingVectorsMappe-1400808889684-1-1-SUCCEEDED-default.jhist
├── job_1400794299637_0010_conf.xml
├── job_1400794299637_0010.summary
├── 
job_1400794299637_0011-1400808893300-tom-ParallelALSFactorizationJob%2DTransposeMapper%2DReduce-1400808924396-1-1-SUCCEEDED-default.jhist
├── job_1400794299637_0011_conf.xml
├── job_1400794299637_0011.summary
├── 
job_1400794299637_0012-1400808926898-tom-ParallelALSFactorizationJob%2DAverageRatingMapper%2DRe-1400808951099-1-1-SUCCEEDED-default.jhist
├── job_1400794299637_0012_conf.xml
└── job_1400794299637_0012.summary
{noformat}

 JobHistoryServer (HistoryFileManager) needs more debug logs.
 

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the JobHistory Server , it appears that its possible sometimes to skip 
 over certain history files.  I havent been able to determine why yet, but 
 I've found that some long named .jhist files aren't getting collected into 
 the done/ directory.
 After tracing some in the actual source, and turning on DEBUG level logging, 
 it became clear that this snippet is an important workhorse 
 (scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles 
 ultimately boil down to scanDirectory()).  
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-23 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5902:


Summary: JobHistoryServer (HistoryFileManager) needs more debug logs, fails 
to pick up jobs with % characters in the name.  (was: JobHistoryServer 
(HistoryFileManager) needs more debug logs.)

 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the JobHistory Server , it appears that its possible sometimes to skip 
 over certain history files.  I havent been able to determine why yet, but 
 I've found that some long named .jhist files aren't getting collected into 
 the done/ directory.
 After tracing some in the actual source, and turning on DEBUG level logging, 
 it became clear that this snippet is an important workhorse 
 (scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles 
 ultimately boil down to scanDirectory()).  
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs.

2014-05-23 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14007959#comment-14007959
 ] 

jay vyas commented on MAPREDUCE-5902:
-

After Further investigation, it appears that files with {{ % escape characters 
}} in them arent picked up by the JobHistoryServer.  I'd like the opinion of 
one of the JobHistoryServer authors to confirm/deny wether jobnames are indeed 
allowed to include {{%}} signs in them, i.e. {{name%-myName}}.  

Has anyone else seen this before?  I'd be somewhat surprised if I was the only 
person who has run into it  I can't imagine its a configuration error of 
any sort?

The below files appear to be stuck in mr-history purgatory, neither are 
they detectable as completed jobs from a REST request {{ curl 
http://10.1.4.138:19888/ws/v1/history/mapreduce/jobs | python -mjson.tool }} to 
the JobHistoryServer API, **nor** are they ever moved to {{/mr-history/done/}}

{noformat}
/mr-history/tmp/tom/job_1400794299637_0010-1400808860349-tom-ParallelALSFactorizationJob%2DItemRatingVectorsMappe-1400808889684-1-1-SUCCEEDED-default.jhist
/mr-history/tmp/tom/job_1400794299637_0011-1400808893300-tom-ParallelALSFactorizationJob%2DTransposeMapper%2DReduce-1400808924396-1-1-SUCCEEDED-default.jhist
/mr-history/tmp/tom/job_1400794299637_0012-1400808926898-tom-ParallelALSFactorizationJob%2DAverageRatingMapper%2DRe-1400808951099-1-1-SUCCEEDED-default.jhist
/mr-history/tmp/tom/job_1400794299637_0017-1400814057680-tom-ParallelALSFactorizationJob%2DItemRatingVectorsMappe-1400814090466-1-1-SUCCEEDED-default.jhist
/mr-history/tmp/tom/job_1400873461827_0016-140087454-tom-select+count%28*%29+from+bps_cleaned%28Stage%2D1%29-1400874621636-1-1-SUCCEEDED-default.jhist
/mr-history/tmp/tom/job_1400873461827_0023-1400894507822-tom-name%252dname-1400894528285-1-1-SUCCEEDED-default.jhist
{noformat}

 JobHistoryServer (HistoryFileManager) needs more debug logs.
 

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the JobHistory Server , it appears that its possible sometimes to skip 
 over certain history files.  I havent been able to determine why yet, but 
 I've found that some long named .jhist files aren't getting collected into 
 the done/ directory.
 After tracing some in the actual source, and turning on DEBUG level logging, 
 it became clear that this snippet is an important workhorse 
 (scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles 
 ultimately boil down to scanDirectory()).  
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up jobs with % characters in the name.

2014-05-23 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5902:


Description: 
1) JobHistoryServer sometimes skips over certain history files, and ignores 
serving them as completed.

2) In addition to skipping these files, the JobHistoryServer doesnt effectively 
log which files are being skipped , and why.  

So In addition to determining why certain types of files are skipped (file name 
length doesnt appear to be the reason, rather, it appears to be that % 
characters throw the JobHistoryServer filter off), we should log completed 
.jhist  files which  are available in the mr-history/tmp directory, yet they 
are skipped for some reason. 

** Regarding the actual bug : Skipping completed jhist files ** 

We will need an author of the JobHistoryServer, I think, to chime in on what 
types of paths for jobs are actually valid.  It appears that at least some 
characters, if in a job name, will make the jobhistoryserver skip recognition 
of a completed jhist file.

** Regarding logging **
It would be extremely useful , then, to have a couple of gaurded logs at this 
level of the code, so that we can see, in the log folders, why files are being 
filtered out  , i.e. it is due to filterint or visibility.

{noformat}

  private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }

{noformat}

** Reproducing ** 

I was able to reproduce this bug by writing a custom mapreduce job with a job 
name, which contained % characters.  I have also seen this with a version of 
the Mahout ParallelALSFactorizationJob, which includes - characters in its 
name, which wind up getting replaced by %2D later on at some stage in the job 
pipeline.


  was:
With the JobHistory Server , it appears that its possible sometimes to skip 
over certain history files.  I havent been able to determine why yet, but I've 
found that some long named .jhist files aren't getting collected into the done/ 
directory.

After tracing some in the actual source, and turning on DEBUG level logging, it 
became clear that this snippet is an important workhorse 
(scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles ultimately 
boil down to scanDirectory()).  

It would be extremely useful , then, to have a couple of gaurded logs at this 
level of the code, so that we can see, in the log folders, why files are being 
filtered out  , i.e. it is due to filterint or visibility.

{noformat}

  private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }

{noformat}




 JobHistoryServer (HistoryFileManager) needs more debug logs, fails to pick up 
 jobs with % characters in the name.
 -

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 1) JobHistoryServer sometimes skips over certain history files, and ignores 
 serving them as completed.
 2) In addition to skipping these files, the JobHistoryServer doesnt 
 effectively log which files are being skipped , and why.  
 So In addition to determining why certain types of files are skipped (file 
 name length doesnt appear to be the reason, rather, it appears to be that % 
 characters throw the JobHistoryServer filter off), we should log completed 
 .jhist  files which  are available in the mr-history/tmp directory, yet they 
 are skipped for some reason. 
 ** Regarding the actual bug : Skipping completed jhist files ** 
 We will need an author of the JobHistoryServer, I think, to chime in on what 
 types of paths for jobs are actually valid.  It appears that at least some 
 characters, if in a job name, will make the 

[jira] [Updated] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs.

2014-05-22 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5902:


Summary: JobHistoryServer (HistoryFileManager) needs more debug logs.  
(was: JobHistoryServer needs more debug logs.)

 JobHistoryServer (HistoryFileManager) needs more debug logs.
 

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the JobHistory Server , it appears that its possible sometimes to skip 
 over certain history files.  I havent been able to determine why yet, but 
 I've found that some long named .jhist files aren't getting collected into 
 the done/ directory.
 After tracing some in the actual source, and turning on DEBUG level logging, 
 it became clear that this snippet is an important workhorse 
 (scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles 
 ultimately boil down to scanDirectory()).  
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5902) JobHistoryServer needs more debug logs.

2014-05-22 Thread jay vyas (JIRA)
jay vyas created MAPREDUCE-5902:
---

 Summary: JobHistoryServer needs more debug logs.
 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas


With the JobHistory Server , it appears that its possible sometimes to skip 
over certain history files.  I havent been able to determine why yet, but I've 
found that some long named .jhist files aren't getting collected into the done/ 
directory.

After tracing some in the actual source, and turning on DEBUG level logging, it 
became clear that this snippet is an important workhorse 
(scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles ultimately 
boil down to scanDirectory()).  

It would be extremely useful , then, to have a couple of gaurded logs at this 
level of the code, so that we can see, in the log folders, why files are being 
filtered out  , i.e. it is due to filterint or visibility.

{noformat}

  private static ListFileStatus scanDirectory(Path path, FileContext fc,
  PathFilter pathFilter) throws IOException {
path = fc.makeQualified(path);
ListFileStatus jhStatusList = new ArrayListFileStatus();
RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
while (fileStatusIter.hasNext()) {
  FileStatus fileStatus = fileStatusIter.next();
  Path filePath = fileStatus.getPath();
  if (fileStatus.isFile()  pathFilter.accept(filePath)) {
jhStatusList.add(fileStatus);
  }
}
return jhStatusList;
  }

{noformat}





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MAPREDUCE-5902) JobHistoryServer (HistoryFileManager) needs more debug logs.

2014-05-22 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14006775#comment-14006775
 ] 

jay vyas commented on MAPREDUCE-5902:
-

FYI, i categorized this as a bug because, without debug logs - it is impossible 
to trace certain issues which occur during file collection into the done/ 
directory, and it is probably an implicit requirement that we should be able to 
know why certain files would be excluded from being collected.

 JobHistoryServer (HistoryFileManager) needs more debug logs.
 

 Key: MAPREDUCE-5902
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5902
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: jobhistoryserver
Reporter: jay vyas
   Original Estimate: 1h
  Remaining Estimate: 1h

 With the JobHistory Server , it appears that its possible sometimes to skip 
 over certain history files.  I havent been able to determine why yet, but 
 I've found that some long named .jhist files aren't getting collected into 
 the done/ directory.
 After tracing some in the actual source, and turning on DEBUG level logging, 
 it became clear that this snippet is an important workhorse 
 (scanDirectoryForIntermediateFiles, and scanDirectoryForHistoryFiles 
 ultimately boil down to scanDirectory()).  
 It would be extremely useful , then, to have a couple of gaurded logs at this 
 level of the code, so that we can see, in the log folders, why files are 
 being filtered out  , i.e. it is due to filterint or visibility.
 {noformat}
   private static ListFileStatus scanDirectory(Path path, FileContext fc,
   PathFilter pathFilter) throws IOException {
 path = fc.makeQualified(path);
 ListFileStatus jhStatusList = new ArrayListFileStatus();
 RemoteIteratorFileStatus fileStatusIter = fc.listStatus(path);
 while (fileStatusIter.hasNext()) {
   FileStatus fileStatus = fileStatusIter.next();
   Path filePath = fileStatus.getPath();
   if (fileStatus.isFile()  pathFilter.accept(filePath)) {
 jhStatusList.add(fileStatus);
   }
 }
 return jhStatusList;
   }
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (MAPREDUCE-5894) Make critical YARN properties first class citizens in the build.

2014-05-17 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5894:


Description: 
We recently found that when deploy hadoop 2.2 with hadoop 2.0 values 
{noformat} mapreduce_shuffle {noformat} changed to {noformat}  
mapreduce.shuffle {noformat} .  

There are likewise many similar examples of parameters which become deprecated 
over time.   See 
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

I suggest we:

1) Have a list of all mandatory *current* parameters stored in the code, and 
also, 

2) a list of deprecated ones. 

3) Then, have the build * automatically fail * a parameter in the madatory 
list is NOT accessed.  this would (a) make it so that unit testing of 
parameters does not regress and (b) force all updates to the code which change 
a parameter name, to also include update to deprecated parameter list, before 
build passes.

  was:
We recently found that when deploy hadoop 2.2 with hadoop 2.0 values 
{noformat} mapreduce_shuffle {noformat} changed to {noformat}  
mapreduce.shuffle {noformat} .  

There are likewise many similar examples of parameters which become deprecated 
over time.   See 
http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html

I suggest we:

1)  Use the *set of parameters which are deprecated* over time into java class 
which ships directly with the code, maybe even as a static list inside of 
Configuration() itself, with *optional extended parameters read from a 
configurable parameter *, so that ecosystem users (i.e. like Hbase, or 
alternative file systems)  can add their own deprecation info.

2) have this list *checked on yarn daemon startup*.  so that unused parameters 
which are *obviously artifacts are flagged immediately* by the daemon failing 
immediately.

3)Have a list of all mandatory *current* parameters stored in the code, and 
also, a list of deprecated ones. Then, have the build * automatically fail * a 
parameter in the madatory list is NOT accessed.  this would (a) make it so 
that unit testing of parameters does not regress and (b) force all updates to 
the code which change a parameter name, to also include update to deprecated 
parameter list, before build passes.


 Make critical YARN properties first class citizens in the build.
 

 Key: MAPREDUCE-5894
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5894
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: jay vyas

 We recently found that when deploy hadoop 2.2 with hadoop 2.0 values 
 {noformat} mapreduce_shuffle {noformat} changed to {noformat}  
 mapreduce.shuffle {noformat} .  
 There are likewise many similar examples of parameters which become 
 deprecated over time.   See 
 http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/DeprecatedProperties.html
 I suggest we:
 1) Have a list of all mandatory *current* parameters stored in the code, and 
 also, 
 2) a list of deprecated ones. 
 3) Then, have the build * automatically fail * a parameter in the madatory 
 list is NOT accessed.  this would (a) make it so that unit testing of 
 parameters does not regress and (b) force all updates to the code which 
 change a parameter name, to also include update to deprecated parameter list, 
 before build passes.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (MAPREDUCE-5572) Provide alternative logic for getPos() implementation in custom RecordReader

2013-10-07 Thread jay vyas (JIRA)
jay vyas created MAPREDUCE-5572:
---

 Summary: Provide alternative logic for getPos() implementation in 
custom RecordReader
 Key: MAPREDUCE-5572
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5572
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Affects Versions: 1.2.1, 1.2.0, 1.1.1, 1.1.0, 1.1.3, 1.2.2
Reporter: jay vyas
Priority: Minor


The custom RecordReader class defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't 
gaurantee to work: The RawLocalFileSystem, for example, currectly will close 
the underlying file stream once it is consumed, and the currentStream will thus 
throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, 
which is only relevant in mapred.* API, calls getPos() twice in tandem, before 
and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it 
assumes something which is not in the FileSystem contract:  That a getPos will 
always return a integral value.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Updated] (MAPREDUCE-5572) Provide alternative logic for getPos() implementation in custom RecordReader of mapred implementation of MultiFileWordCount

2013-10-07 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5572:


Description: 
The custom RecordReader class in MultiFileWordCount (MultiFileLineRecordReader) 
has been replaced in newer examples with a better implementation which uses the 
CombineFileInputFormat, which doesn't feature this bug.  However, this bug 
nevertheless still exists in 1.x versions of the MultiFileWordCount which rely 
on the mapred API.


The older MultiFileWordCount implementation defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't 
gaurantee to work: The RawLocalFileSystem, for example, currectly will close 
the underlying file stream once it is consumed, and the currentStream will thus 
throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, 
which is only relevant in mapred.* API, calls getPos() twice in tandem, before 
and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it 
assumes something which is not in the FileSystem contract:  That a getPos will 
always return a integral value.



  was:
The custom RecordReader class defines the getPos() as follows:

long currentOffset = currentStream == null ? 0 : currentStream.getPos();
...

This is meant to prevent errors when underlying stream is null. But it doesn't 
gaurantee to work: The RawLocalFileSystem, for example, currectly will close 
the underlying file stream once it is consumed, and the currentStream will thus 
throw a NullPointerException when trying to access the null stream.

This is only seen when running this in the context where the MapTask class, 
which is only relevant in mapred.* API, calls getPos() twice in tandem, before 
and after reading a record.

This custom record reader should be gaurded, or else eliminated, since it 
assumes something which is not in the FileSystem contract:  That a getPos will 
always return a integral value.

Summary: Provide alternative logic for getPos() implementation in 
custom RecordReader of mapred implementation of MultiFileWordCount  (was: 
Provide alternative logic for getPos() implementation in custom RecordReader)

 Provide alternative logic for getPos() implementation in custom RecordReader 
 of mapred implementation of MultiFileWordCount
 ---

 Key: MAPREDUCE-5572
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5572
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Affects Versions: 1.1.0, 1.1.1, 1.2.0, 1.1.3, 1.2.1, 1.2.2
Reporter: jay vyas
Priority: Minor

 The custom RecordReader class in MultiFileWordCount 
 (MultiFileLineRecordReader) has been replaced in newer examples with a better 
 implementation which uses the CombineFileInputFormat, which doesn't feature 
 this bug.  However, this bug nevertheless still exists in 1.x versions of the 
 MultiFileWordCount which rely on the mapred API.
 The older MultiFileWordCount implementation defines the getPos() as follows:
 long currentOffset = currentStream == null ? 0 : currentStream.getPos();
 ...
 This is meant to prevent errors when underlying stream is null. But it 
 doesn't gaurantee to work: The RawLocalFileSystem, for example, currectly 
 will close the underlying file stream once it is consumed, and the 
 currentStream will thus throw a NullPointerException when trying to access 
 the null stream.
 This is only seen when running this in the context where the MapTask class, 
 which is only relevant in mapred.* API, calls getPos() twice in tandem, 
 before and after reading a record.
 This custom record reader should be gaurded, or else eliminated, since it 
 assumes something which is not in the FileSystem contract:  That a getPos 
 will always return a integral value.



--
This message was sent by Atlassian JIRA
(v6.1#6144)


[jira] [Created] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?

2013-09-16 Thread jay vyas (JIRA)
jay vyas created MAPREDUCE-5511:
---

 Summary: Multifilewc and the mapred.* API:  Is the use of getPos() 
valid?
 Key: MAPREDUCE-5511
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Reporter: jay vyas
Priority: Minor


The MultiFileWordCount class in the hadoop examples libraries uses a record 
reader which switches between files.  This behaviour can cause the 
RawLocalFileSystem to break in a concurrent environment because of the way 
buffering works (in RawLocalFileSystem, switching between streams results in a 
temproraily null inner stream, and that inner stream is called by the 
getPos() implementation in the custom RecordReader for MultiFileWordCount). 

There are basically 2 ways to handle this:

1) Wrap the getPos() implementation in the object returned by open() in the 
RawLocalFileSystem to cache the value of getPos() everytime it is called, so 
that calls to getPos() can return a valid long even if underlying stream is 
null. OR

2) Update the RecordReader in multifilewc to not rely on the inner input stream 
and cache the position / return 0 if the stream cannot return a valid value. 

The final question here is:  Is the RecordReader for MultiFileWordCount doing 
the right thing ?  Or is it breaking the contract of getPos()... and really... 
what SHOULD getPos() return if the underlying stream has already been consumed? 



--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?

2013-09-16 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5511:


Affects Version/s: 1.0.0
   1.2.0

 Multifilewc and the mapred.* API:  Is the use of getPos() valid?
 

 Key: MAPREDUCE-5511
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Affects Versions: 1.0.0, 1.2.0
Reporter: jay vyas
Priority: Minor

 The MultiFileWordCount class in the hadoop examples libraries uses a record 
 reader which switches between files.  This behaviour can cause the 
 RawLocalFileSystem to break in a concurrent environment because of the way 
 buffering works (in RawLocalFileSystem, switching between streams results in 
 a temproraily null inner stream, and that inner stream is called by the 
 getPos() implementation in the custom RecordReader for MultiFileWordCount). 
 There are basically 2 ways to handle this:
 1) Wrap the getPos() implementation in the object returned by open() in the 
 RawLocalFileSystem to cache the value of getPos() everytime it is called, so 
 that calls to getPos() can return a valid long even if underlying stream is 
 null. OR
 2) Update the RecordReader in multifilewc to not rely on the inner input 
 stream and cache the position / return 0 if the stream cannot return a valid 
 value. 
 The final question here is:  Is the RecordReader for MultiFileWordCount doing 
 the right thing ?  Or is it breaking the contract of getPos()... and 
 really... what SHOULD getPos() return if the underlying stream has already 
 been consumed? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5511) Multifilewc and the mapred.* API: Is the use of getPos() valid?

2013-09-16 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13768433#comment-13768433
 ] 

jay vyas commented on MAPREDUCE-5511:
-

Another note: The newer implementations of multifilewordcount in mapreduce.* 
that dont provide a RecordReader.getPos() implementation don't have this 
problem.   

So this really is related also to support for the multifilewordcount class.  

With new filesystem implementations which mapreduce can work on top of, it is 
important to define the expected semantics of getPos() for FSInputStreams.



 Multifilewc and the mapred.* API:  Is the use of getPos() valid?
 

 Key: MAPREDUCE-5511
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5511
 Project: Hadoop Map/Reduce
  Issue Type: Bug
  Components: examples
Reporter: jay vyas
Priority: Minor

 The MultiFileWordCount class in the hadoop examples libraries uses a record 
 reader which switches between files.  This behaviour can cause the 
 RawLocalFileSystem to break in a concurrent environment because of the way 
 buffering works (in RawLocalFileSystem, switching between streams results in 
 a temproraily null inner stream, and that inner stream is called by the 
 getPos() implementation in the custom RecordReader for MultiFileWordCount). 
 There are basically 2 ways to handle this:
 1) Wrap the getPos() implementation in the object returned by open() in the 
 RawLocalFileSystem to cache the value of getPos() everytime it is called, so 
 that calls to getPos() can return a valid long even if underlying stream is 
 null. OR
 2) Update the RecordReader in multifilewc to not rely on the inner input 
 stream and cache the position / return 0 if the stream cannot return a valid 
 value. 
 The final question here is:  Is the RecordReader for MultiFileWordCount doing 
 the right thing ?  Or is it breaking the contract of getPos()... and 
 really... what SHOULD getPos() return if the underlying stream has already 
 been consumed? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (MAPREDUCE-5165) Create MiniMRCluster version which uses the mapreduce package.

2013-04-18 Thread jay vyas (JIRA)
jay vyas created MAPREDUCE-5165:
---

 Summary: Create MiniMRCluster version which uses the mapreduce 
package.
 Key: MAPREDUCE-5165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: jay vyas
Priority: Minor


The MiniMapRedCluster class references some older mapred.* classes.  

It could be recreated in the mapreduce package to use the Configuration class 
instead of JobConf, which would make it simpler to use and integrate with new 
FS implementations and test harnesses that use new Configuration (not JobConf) 
objects to drive tests.

This could be done many ways:

1) using inheritance or else 
2) by copying the code directly

The appropriate implementation depends on wether or not 

1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
2) Is the mapred MiniMRCluster implementation going to be deprecated or 
eliminated anytime? 
3) What is the future of the JobConf class - which has been deprecated and then 
undeprecated ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5165) Create MiniMRCluster version which uses the mapreduce package.

2013-04-18 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635378#comment-13635378
 ] 

jay vyas commented on MAPREDUCE-5165:
-

I have just found that rehashes contents from another JIRA
https://issues.apache.org/jira/browse/MAPREDUCE-3169... and there is a 
MiniMRYarnCluster in fact in the mapreduce package.

So.. then... Is MiniMRCluster an artifact of the MR1 days that will be less 
used once MR2 takes over ? 

Is MiniMrYarnCluster a generic version of MiniMRCluster which will one day 
obviate the implementation specific MiniMRCluster altogether for the 
Hadoop-MapReduce jobs that are implemented in the MR2 YARN framework? 

 Create MiniMRCluster version which uses the mapreduce package.
 --

 Key: MAPREDUCE-5165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: jay vyas
Priority: Minor

 The MiniMapRedCluster class references some older mapred.* classes.  
 It could be recreated in the mapreduce package to use the Configuration class 
 instead of JobConf, which would make it simpler to use and integrate with new 
 FS implementations and test harnesses that use new Configuration (not 
 JobConf) objects to drive tests.
 This could be done many ways:
 1) using inheritance or else 
 2) by copying the code directly
 The appropriate implementation depends on wether or not 
 1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
 2) Is the mapred MiniMRCluster implementation going to be deprecated or 
 eliminated anytime? 
 3) What is the future of the JobConf class - which has been deprecated and 
 then undeprecated ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (MAPREDUCE-5165) Create MiniMRCluster version which uses the mapreduce package.

2013-04-18 Thread jay vyas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jay vyas updated MAPREDUCE-5165:


Description: 
The MiniMapRedCluster class references some older mapred.* classes (as per 
comments below however, there is the MiniMRYarnCluster, which may aim to 
replace it). 

It could be recreated in the mapreduce package to use the Configuration class 
instead of JobConf, which would make it simpler to use and integrate with new 
FS implementations and test harnesses that use new Configuration (not JobConf) 
objects to drive tests.

This could be done many ways:

1) using inheritance or else 
2) by copying the code directly

The appropriate implementation depends on wether or not 

1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
2) Is the mapred MiniMRCluster implementation going to be deprecated or 
eliminated anytime? 
3) What is the future of the JobConf class - which has been deprecated and then 
undeprecated ?

Note that This is all intimately linked to the role that MiniMRYarnCluster will 
play.  Relevant classes:

.//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientCluster.java

.//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientClusterFactory.java

.//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRCluster.java

.//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRYarnClusterAdapter.java

.//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/MiniMRYarnCluster.java


  was:
The MiniMapRedCluster class references some older mapred.* classes.  

It could be recreated in the mapreduce package to use the Configuration class 
instead of JobConf, which would make it simpler to use and integrate with new 
FS implementations and test harnesses that use new Configuration (not JobConf) 
objects to drive tests.

This could be done many ways:

1) using inheritance or else 
2) by copying the code directly

The appropriate implementation depends on wether or not 

1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
2) Is the mapred MiniMRCluster implementation going to be deprecated or 
eliminated anytime? 
3) What is the future of the JobConf class - which has been deprecated and then 
undeprecated ?


 Create MiniMRCluster version which uses the mapreduce package.
 --

 Key: MAPREDUCE-5165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: jay vyas
Priority: Minor

 The MiniMapRedCluster class references some older mapred.* classes (as per 
 comments below however, there is the MiniMRYarnCluster, which may aim to 
 replace it). 
 It could be recreated in the mapreduce package to use the Configuration class 
 instead of JobConf, which would make it simpler to use and integrate with new 
 FS implementations and test harnesses that use new Configuration (not 
 JobConf) objects to drive tests.
 This could be done many ways:
 1) using inheritance or else 
 2) by copying the code directly
 The appropriate implementation depends on wether or not 
 1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
 2) Is the mapred MiniMRCluster implementation going to be deprecated or 
 eliminated anytime? 
 3) What is the future of the JobConf class - which has been deprecated and 
 then undeprecated ?
 Note that This is all intimately linked to the role that MiniMRYarnCluster 
 will play.  Relevant classes:
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientCluster.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientClusterFactory.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRCluster.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRYarnClusterAdapter.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/MiniMRYarnCluster.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (MAPREDUCE-5165) Create MiniMRCluster version which uses the mapreduce package.

2013-04-18 Thread jay vyas (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-5165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13635448#comment-13635448
 ] 

jay vyas commented on MAPREDUCE-5165:
-

+1 to close, but the deprecation story is somewhat tricky not sure how to 
improve it.  Maybe just a wiki page update to 
http://wiki.apache.org/hadoop/HowToContribute to explain the changes to 
MRMiniCluster would be in order here, or something. 

 Create MiniMRCluster version which uses the mapreduce package.
 --

 Key: MAPREDUCE-5165
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5165
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Reporter: jay vyas
Priority: Minor

 The MiniMapRedCluster class references some older mapred.* classes (as per 
 comments below however, there is the MiniMRYarnCluster, which may aim to 
 replace it). 
 It could be recreated in the mapreduce package to use the Configuration class 
 instead of JobConf, which would make it simpler to use and integrate with new 
 FS implementations and test harnesses that use new Configuration (not 
 JobConf) objects to drive tests.
 This could be done many ways:
 1) using inheritance or else 
 2) by copying the code directly
 The appropriate implementation depends on wether or not 
 1) Is it okay for mapreduce.* classes to depend on mapred.* classes ?
 2) Is the mapred MiniMRCluster implementation going to be deprecated or 
 eliminated anytime? 
 3) What is the future of the JobConf class - which has been deprecated and 
 then undeprecated ?
 Note that This is all intimately linked to the role that MiniMRYarnCluster 
 will play.  Relevant classes:
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientCluster.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRClientClusterFactory.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRCluster.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapred/MiniMRYarnClusterAdapter.java
 .//hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-jobclient/src/test/java/org/apache/hadoop/mapreduce/v2/MiniMRYarnCluster.java

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira