[jira] [Updated] (MAPREDUCE-6840) Distcp to support cutoff time

2017-02-02 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-6840:
--
Attachment: MAPREDUCE-6840.1.patch

> Distcp to support cutoff time
> -
>
> Key: MAPREDUCE-6840
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6840
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: distcp
>Affects Versions: 2.6.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>Priority: Minor
> Attachments: MAPREDUCE-6840.1.patch
>
>
> To ensure consistency in the datasets on HDFS,  some projects like file 
> formats on Hive do HDFS operations in a particular order.  For example, if a 
> file format uses an index file, a new version of the index file will only be 
> written to HDFS after all files mentioned by the index are written to HDFS.
> When we do distcp, it's important to preserve that consistency, so that we 
> don't break those file formats.
> A typical solution for that is to create a HDFS Snapshot beforehand, and only 
> distcp the Snapshot.  That could work well if the user has superuser 
> privilege to make the directory snapshottable.
> If not, then it will be beneficial to have a cutoff time for distcp, so that 
> distcp only copy files modified on/before that cutoff time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Created] (MAPREDUCE-6840) Distcp to support cutoff time

2017-02-02 Thread Zheng Shao (JIRA)
Zheng Shao created MAPREDUCE-6840:
-

 Summary: Distcp to support cutoff time
 Key: MAPREDUCE-6840
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6840
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: distcp
Affects Versions: 2.6.0
Reporter: Zheng Shao
Assignee: Zheng Shao
Priority: Minor


To ensure consistency in the datasets on HDFS,  some projects like file formats 
on Hive do HDFS operations in a particular order.  For example, if a file 
format uses an index file, a new version of the index file will only be written 
to HDFS after all files mentioned by the index are written to HDFS.

When we do distcp, it's important to preserve that consistency, so that we 
don't break those file formats.

A typical solution for that is to create a HDFS Snapshot beforehand, and only 
distcp the Snapshot.  That could work well if the user has superuser privilege 
to make the directory snapshottable.

If not, then it will be beneficial to have a cutoff time for distcp, so that 
distcp only copy files modified on/before that cutoff time.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: mapreduce-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-h...@hadoop.apache.org



[jira] [Commented] (MAPREDUCE-6009) Map-only job with new-api runs wrong OutputCommitter when cleanup scheduled in a reduce slot

2015-07-23 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-6009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14638392#comment-14638392
 ] 

Zheng Shao commented on MAPREDUCE-6009:
---

We just hit this bug in an unpatched version of MR1.

The situation is that HCatalog submits a Map-only job and hopes to use 
OutputCommitter.commitJob to create a Hive partition.  Because of this bug, the 
Hive partition was never created.

Our sanity check on the hive table + workflow retry mechanism allowed us to 
have this bug running in production for a long time (and wasting compute 
resources).  It's great that this is fixed.


> Map-only job with new-api runs wrong OutputCommitter when cleanup scheduled 
> in a reduce slot
> 
>
> Key: MAPREDUCE-6009
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-6009
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: client, job submission
>Affects Versions: 1.2.1
>Reporter: Gera Shegalov
>Assignee: Gera Shegalov
>Priority: Blocker
> Fix For: 1.3.0, 1.2.2
>
> Attachments: MAPREDUCE-6009.v01-branch-1.2.patch, 
> MAPREDUCE-6009.v02-branch-1.2.patch
>
>
> In branch 1 job commit is executed in a JOB_CLEANUP task that may run in 
> either map or reduce slot
> in org.apache.hadoop.mapreduce.Job#setUseNewAPI there is a logic setting 
> new-api flag only for reduce-ful jobs.
> {code}
> if (numReduces != 0) {
>   conf.setBooleanIfUnset("mapred.reducer.new-api",
>  conf.get(oldReduceClass) == null);
>   ...
> {code}
> Therefore, when cleanup runs in a reduce slot, ReduceTask inits using the old 
> API and runs incorrect default OutputCommitter, instead of consulting 
> OutputFormat.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-1144) JT should not hold lock while writing user history logs to DFS

2015-03-19 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao reassigned MAPREDUCE-1144:
-

Assignee: (was: Zheng Shao)

> JT should not hold lock while writing user history logs to DFS
> --
>
> Key: MAPREDUCE-1144
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1144
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1
>Reporter: Todd Lipcon
> Attachments: MAPREDUCE-1144-branch-1.2.patch
>
>
> I've seen behavior a few times now where the DFS is being slow for one reason 
> or another, and the JT essentially locks up waiting on it while one thread 
> tries for a long time to write history files out. The stack trace blocking 
> everything is:
> Thread 210 (IPC Server handler 10 on 7277):
>   State: WAITING
>   Blocked count: 171424
>   Waited count: 1209604
>   Waiting on java.util.LinkedList@407dd154
>   Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3122)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3202)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3151)
> 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:67)
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
> sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:301)
> sun.nio.cs.StreamEncoder.close(StreamEncoder.java:130)
> java.io.OutputStreamWriter.close(OutputStreamWriter.java:216)
> java.io.BufferedWriter.close(BufferedWriter.java:248)
> java.io.PrintWriter.close(PrintWriter.java:295)
> 
> org.apache.hadoop.mapred.JobHistory$JobInfo.logFinished(JobHistory.java:1349)
> 
> org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2167)
> 
> org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2111)
> 
> org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:873)
> 
> org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3598)
> org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2792)
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2581)
> sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> We should try not to do external IO while holding the JT lock, and instead 
> write the data to an in-memory buffer, drop the lock, and then write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MAPREDUCE-1144) JT should not hold lock while writing user history logs to DFS

2015-03-19 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao reassigned MAPREDUCE-1144:
-

Assignee: Zheng Shao

> JT should not hold lock while writing user history logs to DFS
> --
>
> Key: MAPREDUCE-1144
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1144
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker
>Affects Versions: 0.20.1
>Reporter: Todd Lipcon
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1144-branch-1.2.patch
>
>
> I've seen behavior a few times now where the DFS is being slow for one reason 
> or another, and the JT essentially locks up waiting on it while one thread 
> tries for a long time to write history files out. The stack trace blocking 
> everything is:
> Thread 210 (IPC Server handler 10 on 7277):
>   State: WAITING
>   Blocked count: 171424
>   Waited count: 1209604
>   Waiting on java.util.LinkedList@407dd154
>   Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.flushInternal(DFSClient.java:3122)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3202)
> 
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3151)
> 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:67)
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:106)
> sun.nio.cs.StreamEncoder.implClose(StreamEncoder.java:301)
> sun.nio.cs.StreamEncoder.close(StreamEncoder.java:130)
> java.io.OutputStreamWriter.close(OutputStreamWriter.java:216)
> java.io.BufferedWriter.close(BufferedWriter.java:248)
> java.io.PrintWriter.close(PrintWriter.java:295)
> 
> org.apache.hadoop.mapred.JobHistory$JobInfo.logFinished(JobHistory.java:1349)
> 
> org.apache.hadoop.mapred.JobInProgress.jobComplete(JobInProgress.java:2167)
> 
> org.apache.hadoop.mapred.JobInProgress.completedTask(JobInProgress.java:2111)
> 
> org.apache.hadoop.mapred.JobInProgress.updateTaskStatus(JobInProgress.java:873)
> 
> org.apache.hadoop.mapred.JobTracker.updateTaskStatuses(JobTracker.java:3598)
> org.apache.hadoop.mapred.JobTracker.processHeartbeat(JobTracker.java:2792)
> org.apache.hadoop.mapred.JobTracker.heartbeat(JobTracker.java:2581)
> sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
> We should try not to do external IO while holding the JT lock, and instead 
> write the data to an in-memory buffer, drop the lock, and then write.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] Commented: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-09-08 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12907441#action_12907441
 ] 

Zheng Shao commented on MAPREDUCE-1382:
---

I believe other logic in TaskTracker/JobTracker will fail and report in that 
case.


> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch, MAPREDUCE-1382.2.patch, 
> MAPREDUCE-1382.3.patch, 
> MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-09-08 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1382:
--

Attachment: MAPREDUCE-1382.3.patch

Fixed unit test.

> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch, MAPREDUCE-1382.2.patch, 
> MAPREDUCE-1382.3.patch, 
> MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-09-08 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1382:
--

Status: Patch Available  (was: Open)

> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch, MAPREDUCE-1382.2.patch, 
> MAPREDUCE-1382.3.patch, 
> MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-09-08 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1382:
--

Attachment: MAPREDUCE-1382.2.patch

Todd, you are right. LocalFileSystem will throw IOE in that case.
This patch addresses Todd's concern.


> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch, MAPREDUCE-1382.2.patch, 
> MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1887) MRAsyncDiskService does not properly absolutize volume root paths

2010-06-24 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1887:
--

   Status: Resolved  (was: Patch Available)
 Hadoop Flags: [Reviewed]
 Release Note: MAPREDUCE-1887. MRAsyncDiskService now properly absolutizes 
volume root paths. (Aaron Kimball via zshao)
Fix Version/s: 0.22.0
   Resolution: Fixed

Committed revision 957772. Thanks Aaron!

> MRAsyncDiskService does not properly absolutize volume root paths
> -
>
> Key: MAPREDUCE-1887
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1887
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1887.2.patch, MAPREDUCE-1887.3.patch, 
> MAPREDUCE-1887.patch
>
>
> In MRAsyncDiskService, volume names are sometimes specified as relative 
> paths, which are not converted to absolute paths. This can cause errors of 
> the form "cannot delete  since it is outside of 
> " even though the actual path is inside the root. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1887) MRAsyncDiskService does not properly absolutize volume root paths

2010-06-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882372#action_12882372
 ] 

Zheng Shao commented on MAPREDUCE-1887:
---

Can you take a look at the failed contrib tests?

> MRAsyncDiskService does not properly absolutize volume root paths
> -
>
> Key: MAPREDUCE-1887
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1887
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1887.2.patch, MAPREDUCE-1887.3.patch, 
> MAPREDUCE-1887.patch
>
>
> In MRAsyncDiskService, volume names are sometimes specified as relative 
> paths, which are not converted to absolute paths. This can cause errors of 
> the form "cannot delete  since it is outside of 
> " even though the actual path is inside the root. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1887) MRAsyncDiskService does not properly absolutize volume root paths

2010-06-23 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881875#action_12881875
 ] 

Zheng Shao commented on MAPREDUCE-1887:
---

Aaron, can you take a look at the unit test failures?

> MRAsyncDiskService does not properly absolutize volume root paths
> -
>
> Key: MAPREDUCE-1887
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1887
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1887.2.patch, MAPREDUCE-1887.3.patch, 
> MAPREDUCE-1887.patch
>
>
> In MRAsyncDiskService, volume names are sometimes specified as relative 
> paths, which are not converted to absolute paths. This can cause errors of 
> the form "cannot delete  since it is outside of 
> " even though the actual path is inside the root. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1887) MRAsyncDiskService does not properly absolutize volume root paths

2010-06-22 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1887:
--

Status: Open  (was: Patch Available)

> MRAsyncDiskService does not properly absolutize volume root paths
> -
>
> Key: MAPREDUCE-1887
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1887
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1887.2.patch, MAPREDUCE-1887.patch
>
>
> In MRAsyncDiskService, volume names are sometimes specified as relative 
> paths, which are not converted to absolute paths. This can cause errors of 
> the form "cannot delete  since it is outside of 
> " even though the actual path is inside the root. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1887) MRAsyncDiskService does not properly absolutize volume root paths

2010-06-22 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12881397#action_12881397
 ] 

Zheng Shao commented on MAPREDUCE-1887:
---

Code looks good.

Can we change

{code}
+   * @param nonCanonicalVols The roots of the file system volumes, which may 
not
+   * be canonical paths.
{code}

to 

{code}
+   * @param nonCanonicalVols The roots of the file system volumes, which can 
be 
+   * absolute paths from root or relative path from cwd.
{code}

?

I think the second one is easier to understand.


> MRAsyncDiskService does not properly absolutize volume root paths
> -
>
> Key: MAPREDUCE-1887
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1887
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Reporter: Aaron Kimball
>Assignee: Aaron Kimball
> Attachments: MAPREDUCE-1887.2.patch, MAPREDUCE-1887.patch
>
>
> In MRAsyncDiskService, volume names are sometimes specified as relative 
> paths, which are not converted to absolute paths. This can cause errors of 
> the form "cannot delete  since it is outside of 
> " even though the actual path is inside the root. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1568) TrackerDistributedCacheManager should clean up cache in a background thread

2010-04-29 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1568:
--

  Status: Resolved  (was: Patch Available)
Hadoop Flags: [Reviewed]
Release Note: MAPREDUCE-1568. TrackerDistributedCacheManager should clean 
up cache in a background thread. (Scott Chen via zshao)
  Resolution: Fixed

Committed. Thanks Scott!

> TrackerDistributedCacheManager should clean up cache in a background thread
> ---
>
> Key: MAPREDUCE-1568
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1568
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.22.0
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1568-v2.1.txt, MAPREDUCE-1568-v2.txt, 
> MAPREDUCE-1568-v3.1.txt, MAPREDUCE-1568-v3.txt, MAPREDUCE-1568.txt
>
>
> Right now the TrackerDistributedCacheManager do the clean up with the 
> following code path:
> {code}
> TaskRunner.run() -> 
> TrackerDistributedCacheManager.setup() ->
> TrackerDistributedCacheManager.getLocalCache() -> 
> TrackerDistributedCacheManager.deleteCache()
> {code}
> The deletion of the cache files can take a long time and it should not be 
> done by a task. We suggest that there should be a separate thread checking 
> and clean up the cache files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1568) TrackerDistributedCacheManager should do deleteLocalPath asynchronously

2010-04-22 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860091#action_12860091
 ] 

Zheng Shao commented on MAPREDUCE-1568:
---

+1


> TrackerDistributedCacheManager should do deleteLocalPath asynchronously
> ---
>
> Key: MAPREDUCE-1568
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1568
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.22.0
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1568.txt
>
>
> TrackerDistributedCacheManager.deleteCache() has been improved:
> MAPREDUCE-1302 makes TrackerDistributedCacheManager rename the caches in the 
> main thread and then delete them in the background 
> MAPREDUCE-1098 avoids global locking while do the renaming (renaming lots of 
> directories can also takes a long time)
> But the deleteLocalCache is still in the main thread of TaskRunner.run(). So 
> it will still slow down the task which triggers the deletion (originally this 
> will blocks all tasks, but it is fixed by MAPREDUCE-1098). Other tasks do not 
> wait for the deletion. The task which triggers the deletion should not wait 
> for this either. TrackerDistributedCacheManager should do deleteLocalPath() 
> asynchronously.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1649) Compressed files with TextInputFormat does not work with CombineFileInputFormat

2010-03-30 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1649:
--

Attachment: MAPREDUCE-1649.1.branch-0.20.patch

A simple fix (MAPREDUCE-1649.1.branch-0.20.patch) is to ignore the splits that 
start with non-0 offset in {{TextInputFormat}} when the file is non-splittable.


> Compressed files with TextInputFormat does not work with 
> CombineFileInputFormat
> ---
>
> Key: MAPREDUCE-1649
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1649
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.2
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1649.1.branch-0.20.patch
>
>
> {{CombineFileInputFormat}} creates splits based on blocks, regardless whether 
> the underlying {{FileInputFormat}} is splittable or not..
> This means that we can have 2 or more splits for a compressed text file with 
> {{TextInputFormat}}. For each of these splits, 
> {{TextInputFormat.getRecordReader}} will return a {{RecordReader}} for the 
> whole compressed file, thus causing duplicate input data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1649) Compressed files with TextInputFormat does not work with CombineFileInputFormat

2010-03-30 Thread Zheng Shao (JIRA)
Compressed files with TextInputFormat does not work with CombineFileInputFormat
---

 Key: MAPREDUCE-1649
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1649
 Project: Hadoop Map/Reduce
  Issue Type: Bug
Affects Versions: 0.20.2
Reporter: Zheng Shao
Assignee: Zheng Shao


{{CombineFileInputFormat}} creates splits based on blocks, regardless whether 
the underlying {{FileInputFormat}} is splittable or not..

This means that we can have 2 or more splits for a compressed text file with 
{{TextInputFormat}}. For each of these splits, 
{{TextInputFormat.getRecordReader}} will return a {{RecordReader}} for the 
whole compressed file, thus causing duplicate input data.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-03-08 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao resolved MAPREDUCE-1501.
---

Resolution: Fixed

Will open a new one to address this issue.

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1577) FileInputFormat in the new mapreduce package to support multi-level/recursive directory listing

2010-03-08 Thread Zheng Shao (JIRA)
FileInputFormat in the new mapreduce package to support multi-level/recursive 
directory listing 


 Key: MAPREDUCE-1577
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1577
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Zheng Shao
Assignee: Zheng Shao


See MAPREDUCE-1501 for details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Reopened: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-03-08 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao reopened MAPREDUCE-1501:
---


Reopened for Chris's comments.


> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1423) Improve performance of CombineFileInputFormat when multiple pools are configured

2010-03-03 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1423:
--

 Tags: combinefileinputformat
   Resolution: Fixed
Fix Version/s: 0.22.0
 Release Note: MAPREDUCE-1423. Improve performance of 
CombineFileInputFormat when multiple pools are configured. (Dhruba Borthakur 
via zshao)
 Hadoop Flags: [Reviewed]
   Status: Resolved  (was: Patch Available)

Committed. Thanks Dhruba!

> Improve performance of CombineFileInputFormat when multiple pools are 
> configured
> 
>
> Key: MAPREDUCE-1423
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1423
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: client
>Reporter: dhruba borthakur
>Assignee: dhruba borthakur
> Fix For: 0.22.0
>
> Attachments: CombineFileInputFormatPerformance.txt, 
> CombineFileInputFormatPerformance.txt
>
>
> I have a map-reduce job that is using CombineFileInputFormat. It has 
> configured 1 pools and 3 files. The time to create the splits takes 
> more than an hour. The reaosn being that CombineFileInputFormat.getSplits() 
> converts the same path from String to Path object multiple times, one for 
> each instance of a pool. Similarly, it calls Path.toUri(0 multiple times. 
> This code can be optimized.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1538) TrackerDistributedCacheManager can fail because the number of subdirectories reaches system limit

2010-02-26 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12839131#action_12839131
 ] 

Zheng Shao commented on MAPREDUCE-1538:
---

+1

> TrackerDistributedCacheManager can fail because the number of subdirectories 
> reaches system limit
> -
>
> Key: MAPREDUCE-1538
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1538
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: Scott Chen
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1538.patch
>
>
> TrackerDistributedCacheManager deletes the cached files when the size goes up 
> to a configured number.
> But there is no such limit for the number of subdirectories. Therefore the 
> number of subdirectories may grow large and exceed system limit.
> This will make TT cannot create directory when getLocalCache and fails the 
> tasks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1221) Kill tasks on a node if the free physical memory on that machine falls below a configured threshold

2010-02-24 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838023#action_12838023
 ] 

Zheng Shao commented on MAPREDUCE-1221:
---

Arun, does the explanations from Scott and Matei make sense to you?
If it looks good to you, I would like to commit it.


> Kill tasks on a node if the free physical memory on that machine falls below 
> a configured threshold
> ---
>
> Key: MAPREDUCE-1221
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1221
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1221-v1.patch, MAPREDUCE-1221-v2.patch, 
> MAPREDUCE-1221-v3.patch
>
>
> The TaskTracker currently supports killing tasks if the virtual memory of a 
> task exceeds a set of configured thresholds. I would like to extend this 
> feature to enable killing tasks if the physical memory used by that task 
> exceeds a certain threshold.
> On a certain operating system (guess?), if user space processes start using 
> lots of memory, the machine hangs and dies quickly. This means that we would 
> like to prevent map-reduce jobs from triggering this condition. From my 
> understanding, the killing-based-on-virtual-memory-limits (HADOOP-5883) were 
> designed to address this problem. This works well when most map-reduce jobs 
> are Java jobs and have well-defined -Xmx parameters that specify the max 
> virtual memory for each task. On the other hand, if each task forks off 
> mappers/reducers written in other languages (python/php, etc), the total 
> virtual memory usage of the process-subtree varies greatly. In these cases, 
> it is better to use kill-tasks-using-physical-memory-limits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-22 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836963#action_12836963
 ] 

Zheng Shao commented on MAPREDUCE-1501:
---

Thanks Dhruba. I missed the part "and other hidden directories". We do call 
PathFilter on the sub directories as well (see addInputPathRecursively(...)). 
Is that good enough or we want to split the PathFilters for files and the 
PathFilters for directories?


> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-22 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12836936#action_12836936
 ] 

Zheng Shao commented on MAPREDUCE-1501:
---

Thanks for the feedback Ian.
I don't think FileSystem.listPath() returns "." or  "..". If it does, I believe 
the current code in trunk will also break. The new unit test will also fail if 
that's the case.


> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-22 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1501:
--

Status: Patch Available  (was: Open)

There are 2 test failures but I don't think they are related. Resubmitting 
patch to get it tested again.

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-22 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1501:
--

Status: Open  (was: Patch Available)

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Resolved: (MAPREDUCE-1504) SequenceFile.Reader constructor leaking resources

2010-02-21 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao resolved MAPREDUCE-1504.
---

Resolution: Fixed

Fixed in HADOOP-5476

> SequenceFile.Reader constructor leaking resources
> -
>
> Key: MAPREDUCE-1504
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1504
> Project: Hadoop Map/Reduce
>  Issue Type: New Feature
>Reporter: Zheng Shao
>
> When {{SequenceFile.Reader}} constructor throws an {{IOException}} (because 
> the file does not conform to {{SequenceFile}} format), we will have such a 
> problem.
> The caller won't have a pointer to the reader because of the {{IOException}} 
> thrown.
> We should call {{in.close()}} inside the constructor to make sure that we 
> don't leak resources (file descriptor and connection to the data node, etc).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-20 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1501:
--

Attachment: MAPREDUCE-1501.1.trunk.patch

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-20 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1501:
--

Status: Patch Available  (was: Open)

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch, 
> MAPREDUCE-1501.1.trunk.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-20 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1501:
--

Attachment: MAPREDUCE-1501.1.branch-0.20.patch

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1501.1.branch-0.20.patch
>
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-20 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao reassigned MAPREDUCE-1501:
-

Assignee: Zheng Shao

> FileInputFormat to support multi-level/recursive directory listing
> --
>
> Key: MAPREDUCE-1501
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>
> As we have seen multiple times in the mailing list, users want to have the 
> capability of getting all files out of a multi-level directory structure.
> 4/1/2008: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e
> 2/3/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e
> 6/2/2009: 
> http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e
> One solution that our users had is to write a new FileInputFormat, but that 
> means all existing FileInputFormat subclasses need to be changed in order to 
> support this feature.
> We can easily provide a JobConf option (which defaults to false) to 
> {{FileInputFormat.listStatus(...)}} to recursively go into directory 
> structure.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1221) Kill tasks on a node if the free physical memory on that machine falls below a configured threshold

2010-02-20 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1221:
--

Status: Patch Available  (was: Open)

> Kill tasks on a node if the free physical memory on that machine falls below 
> a configured threshold
> ---
>
> Key: MAPREDUCE-1221
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1221
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1221-v1.patch, MAPREDUCE-1221-v2.patch, 
> MAPREDUCE-1221-v3.patch
>
>
> The TaskTracker currently supports killing tasks if the virtual memory of a 
> task exceeds a set of configured thresholds. I would like to extend this 
> feature to enable killing tasks if the physical memory used by that task 
> exceeds a certain threshold.
> On a certain operating system (guess?), if user space processes start using 
> lots of memory, the machine hangs and dies quickly. This means that we would 
> like to prevent map-reduce jobs from triggering this condition. From my 
> understanding, the killing-based-on-virtual-memory-limits (HADOOP-5883) were 
> designed to address this problem. This works well when most map-reduce jobs 
> are Java jobs and have well-defined -Xmx parameters that specify the max 
> virtual memory for each task. On the other hand, if each task forks off 
> mappers/reducers written in other languages (python/php, etc), the total 
> virtual memory usage of the process-subtree varies greatly. In these cases, 
> it is better to use kill-tasks-using-physical-memory-limits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1221) Kill tasks on a node if the free physical memory on that machine falls below a configured threshold

2010-02-19 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12835966#action_12835966
 ] 

Zheng Shao commented on MAPREDUCE-1221:
---

Scott, can you replace TAB with 2 spaces in your code?


> Kill tasks on a node if the free physical memory on that machine falls below 
> a configured threshold
> ---
>
> Key: MAPREDUCE-1221
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1221
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1221-v1.patch, MAPREDUCE-1221-v2.patch
>
>
> The TaskTracker currently supports killing tasks if the virtual memory of a 
> task exceeds a set of configured thresholds. I would like to extend this 
> feature to enable killing tasks if the physical memory used by that task 
> exceeds a certain threshold.
> On a certain operating system (guess?), if user space processes start using 
> lots of memory, the machine hangs and dies quickly. This means that we would 
> like to prevent map-reduce jobs from triggering this condition. From my 
> understanding, the killing-based-on-virtual-memory-limits (HADOOP-5883) were 
> designed to address this problem. This works well when most map-reduce jobs 
> are Java jobs and have well-defined -Xmx parameters that specify the max 
> virtual memory for each task. On the other hand, if each task forks off 
> mappers/reducers written in other languages (python/php, etc), the total 
> virtual memory usage of the process-subtree varies greatly. In these cases, 
> it is better to use kill-tasks-using-physical-memory-limits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1221) Kill tasks on a node if the free physical memory on that machine falls below a configured threshold

2010-02-19 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1221:
--

Status: Open  (was: Patch Available)

> Kill tasks on a node if the free physical memory on that machine falls below 
> a configured threshold
> ---
>
> Key: MAPREDUCE-1221
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1221
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.22.0
>Reporter: dhruba borthakur
>Assignee: Scott Chen
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1221-v1.patch, MAPREDUCE-1221-v2.patch
>
>
> The TaskTracker currently supports killing tasks if the virtual memory of a 
> task exceeds a set of configured thresholds. I would like to extend this 
> feature to enable killing tasks if the physical memory used by that task 
> exceeds a certain threshold.
> On a certain operating system (guess?), if user space processes start using 
> lots of memory, the machine hangs and dies quickly. This means that we would 
> like to prevent map-reduce jobs from triggering this condition. From my 
> understanding, the killing-based-on-virtual-memory-limits (HADOOP-5883) were 
> designed to address this problem. This works well when most map-reduce jobs 
> are Java jobs and have well-defined -Xmx parameters that specify the max 
> virtual memory for each task. On the other hand, if each task forks off 
> mappers/reducers written in other languages (python/php, etc), the total 
> virtual memory usage of the process-subtree varies greatly. In these cases, 
> it is better to use kill-tasks-using-physical-memory-limits.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1504) SequenceFile.Reader constructor leaking resources

2010-02-18 Thread Zheng Shao (JIRA)
SequenceFile.Reader constructor leaking resources
-

 Key: MAPREDUCE-1504
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1504
 Project: Hadoop Map/Reduce
  Issue Type: New Feature
Reporter: Zheng Shao


When {{SequenceFile.Reader}} constructor throws an {{IOException}} (because the 
file does not conform to {{SequenceFile}} format), we will have such a problem.
The caller won't have a pointer to the reader because of the {{IOException}} 
thrown.

We should call {{in.close()}} inside the constructor to make sure that we don't 
leak resources (file descriptor and connection to the data node, etc).


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1501) FileInputFormat to support multi-level/recursive directory listing

2010-02-17 Thread Zheng Shao (JIRA)
FileInputFormat to support multi-level/recursive directory listing
--

 Key: MAPREDUCE-1501
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1501
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Zheng Shao


As we have seen multiple times in the mailing list, users want to have the 
capability of getting all files out of a multi-level directory structure.

4/1/2008: 
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e

2/3/2009: 
http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200902.mbox/%3c7f80089c-3e7f-4330-90ba-6f1c5b0b0...@nist.gov%3e

6/2/2009: 
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200906.mbox/%3c4a258a16.8050...@darose.net%3e


One solution that our users had is to write a new FileInputFormat, but that 
means all existing FileInputFormat subclasses need to be changed in order to 
support this feature.

We can easily provide a JobConf option (which defaults to false) to 
{{FileInputFormat.listStatus(...)}} to recursively go into directory structure.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-19 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12802500#action_12802500
 ] 

Zheng Shao commented on MAPREDUCE-1374:
---

Right I think JT uses RawSplits.

This issue is trying to fix the memory footprint of the JobClient.

We call InputFormat.getSplits(job) which returns all splits in an array. This 
costs a lot of memory. I verified that this is still true for trunk.


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-01-19 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1382:
--

Attachment: MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch

patch for 0.20

> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch, 
> MAPREDUCE-1382.branch-0.20.on.top.of.MAPREDUCE-1302.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1382) MRAsyncDiscService should tolerate missing local.dir

2010-01-16 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1382:
--

Attachment: MAPREDUCE-1382.1.patch

Added a test that tests deletion non-existing files.

> MRAsyncDiscService should tolerate missing local.dir
> 
>
> Key: MAPREDUCE-1382
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1382
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Scott Chen
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1382.1.patch
>
>
> Currently when some of the local.dir do not exist, MRAsyncDiscService will 
> fail. It should only fail when all directories don't work.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-15 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12800969#action_12800969
 ] 

Zheng Shao commented on MAPREDUCE-1374:
---

Verified that the test error is not related:
{code}
java.lang.ClassNotFoundException: org.apache.hadoop.mapred.TestTTMemoryReporting
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:169)
{code}


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.3.patch

Fixed some unit test failures for hadoop-0.20. Note that this patch can only be 
applied to hadoop-0.20 after the MAPREDUCE-1213 is applied.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch, 
> MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.3.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2010-01-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.branch-0.20.2.patch

Removed unnecessary changes for hadoop 0.20.

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch, 
> MAPREDUCE-1213.branch-0.20.2.patch, MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Status: Patch Available  (was: Open)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Status: Open  (was: Patch Available)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Status: Open  (was: Patch Available)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Status: Patch Available  (was: Open)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1333) Parallel running tasks on one single node may slow down the performance

2010-01-13 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799895#action_12799895
 ] 

Zheng Shao commented on MAPREDUCE-1333:
---

How many CPU cores do each of the node have?

> Parallel running tasks on one single node may slow down the performance
> ---
>
> Key: MAPREDUCE-1333
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1333
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: jobtracker, task, tasktracker
>Affects Versions: 0.20.1
>Reporter: Zhaoning Zhang
>
> When I analysis running tasks performance, I found that parallel running 
> tasks on one single node will not be better performance than the serialized 
> ones.
> We can set mapred.tasktracker.{map|reduce}.tasks.maximum = 1 individually, 
> but there will be parallel map AND reduce tasks.
> And I wonder it's true in the real commercial clusters?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Attachment: MAPREDUCE-1374.3.patch

Added comment before "Path getPath()" to address Todd's comment.


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch, 
> MAPREDUCE-1374.3.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-13 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799890#action_12799890
 ] 

Zheng Shao commented on MAPREDUCE-1374:
---

Thanks Todd.
Yes I see the merit of adding a weak reference map in the Path class. That will 
still consume several times larger memory than String, but will help remove the 
potential duplicate Path objects.


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Attachment: MAPREDUCE-1374.2.patch

Added test case.

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Fix Version/s: 0.22.0
   0.21.0
   Status: Patch Available  (was: Open)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.21.0, 0.22.0
>
> Attachments: MAPREDUCE-1374.1.patch, MAPREDUCE-1374.2.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Attachment: MAPREDUCE-1374.1.patch

I am not sure whether I should create a new String[] in the constructor and 
then change the elements.

Since file is private, this should be compatible with any other derived classes.


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1374.1.patch
>
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12799632#action_12799632
 ] 

Zheng Shao commented on MAPREDUCE-1374:
---

This experiment is done on hadoop-0.20. It shows JobClient memory usage by 
submitting a map-reduce job with around 200K mappers:

jmap before using this patch: (OOM before getting to the same stage as the 
second example)
{code}
 num #instances #bytes  class name
--
   1:188870   18107344  [C
   2:2426169704640  java.lang.String
   3: 428506543408  
   4: 732185271696  
org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit
   5: 428505151504  
   6:  35704693192  
   7: 720773647360  
   8: 733073518736  org.apache.hadoop.mapred.FileSplit
   9: 754243075008  [Ljava.lang.String;
  10:  35702818968  
  11:  27412524096  
...
  14: 100691449936  java.net.URI
...
  23: 10065 241560  org.apache.hadoop.fs.Path
{code}


jmap after this patch:
{code}
 num #instances #bytes  class name
--
   1:199014   14329008  
org.apache.hadoop.hive.ql.io.HiveInputFormat$HiveInputSplit
   2:2018019818856  [Ljava.lang.String;
   3:1996849584832  org.apache.hadoop.mapred.FileSplit
   4: 565948211632  [C
   5: 428516543872  
   6: 428515151624  
   7:  35704693616  
   8: 720913648368  
   9:  35702818968  
  10:  25172675256  [Ljava.lang.Object;
  11:  47632531104  [I
  12:  27412524320  
  13: 622752491000  java.lang.String
...
  31:   456  65664  java.net.URI
...
  69:   452  10848  org.apache.hadoop.fs.Path
{code}


String:FileSplit ratio:
before this patch: 3.3 : 1
after this patch: 0.3 : 1

We reduced the number of String object by 10 times!


> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao reassigned MAPREDUCE-1374:
-

Assignee: Zheng Shao

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1374) Reduce memory footprint of FileSplit

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1374:
--

Description: 
We can have many FileInput objects in the memory, depending on the number of 
mappers.

It will save tons of memory on JobTracker and JobClient if we intern those 
Strings for host names.

{code}
FileInputFormat.java:

  for (NodeInfo host: hostList) {
// Strip out the port number from the host name
-retVal[index++] = host.node.getName().split(":")[0];
+retVal[index++] = host.node.getName().split(":")[0].intern();
if (index == replicationFactor) {
  done = true;
  break;
}
  }
{code}

More on String.intern(): 
http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html


It will also save a lot of memory by changing the class of {{file}} from 
{{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
contains ~10 String fields. This will also be a huge saving.

{code}
  private Path file;
{code}



  was:
We can have many FileInput objects in the memory, depending on the number of 
mappers.
It will save tons of memory on JobTracker and JobClient if we intern those 
Strings for host names.

{code}
FileInputFormat.java:

  for (NodeInfo host: hostList) {
// Strip out the port number from the host name
-retVal[index++] = host.node.getName().split(":")[0];
+retVal[index++] = host.node.getName().split(":")[0].intern();
if (index == replicationFactor) {
  done = true;
  break;
}
  }
{code}

More on String.intern(): 
http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html


Summary: Reduce memory footprint of FileSplit  (was: FileSplit.hosts 
should have the host names "intern"ed)

> Reduce memory footprint of FileSplit
> 
>
> Key: MAPREDUCE-1374
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Affects Versions: 0.20.1, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>
> We can have many FileInput objects in the memory, depending on the number of 
> mappers.
> It will save tons of memory on JobTracker and JobClient if we intern those 
> Strings for host names.
> {code}
> FileInputFormat.java:
>   for (NodeInfo host: hostList) {
> // Strip out the port number from the host name
> -retVal[index++] = host.node.getName().split(":")[0];
> +retVal[index++] = host.node.getName().split(":")[0].intern();
> if (index == replicationFactor) {
>   done = true;
>   break;
> }
>   }
> {code}
> More on String.intern(): 
> http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html
> It will also save a lot of memory by changing the class of {{file}} from 
> {{Path}} to {{String}}. {{Path}} contains a {{java.net.URI}} which internally 
> contains ~10 String fields. This will also be a huge saving.
> {code}
>   private Path file;
> {code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1374) FileSplit.hosts should have the host names "intern"ed

2010-01-12 Thread Zheng Shao (JIRA)
FileSplit.hosts should have the host names "intern"ed
-

 Key: MAPREDUCE-1374
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1374
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Affects Versions: 0.20.1, 0.21.0, 0.22.0
Reporter: Zheng Shao


We can have many FileInput objects in the memory, depending on the number of 
mappers.
It will save tons of memory on JobTracker and JobClient if we intern those 
Strings for host names.

{code}
FileInputFormat.java:

  for (NodeInfo host: hostList) {
// Strip out the port number from the host name
-retVal[index++] = host.node.getName().split(":")[0];
+retVal[index++] = host.node.getName().split(":")[0].intern();
if (index == replicationFactor) {
  done = true;
  break;
}
  }
{code}

More on String.intern(): 
http://www.javaworld.com/javaworld/javaqa/2003-12/01-qa-1212-intern.html


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: (was: 
MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.patch)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch, 
> MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.2.patch

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch, 
> MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.patch

This patch is for branch-0.20 (on top of MAPREDUCE-1213.branch-0.20.patch)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch, 
> MAPREDUCE-1302.branch-0.20.on.top.of.MAPREDUCE-1213.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2010-01-12 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.branch-0.20.patch

Patch for 0.20.

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Fix For: 0.22.0
>
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch, 
> MAPREDUCE-1213.branch-0.20.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-11 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-11 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-11 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.5.patch

Merged with latest trunk.
Vinod, can you take a look?


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch, 
> MAPREDUCE-1302.5.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1186) While localizing a DistributedCache file, TT sets permissions recursively on the whole base-dir

2010-01-11 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12798781#action_12798781
 ] 

Zheng Shao commented on MAPREDUCE-1186:
---

Does this change mean that we cannot package a bunch of python scripts into a 
zip/jar file, and let hadoop unpack them and run it?


> While localizing a DistributedCache file, TT sets permissions recursively on 
> the whole base-dir
> ---
>
> Key: MAPREDUCE-1186
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1186
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>  Components: tasktracker
>Affects Versions: 0.21.0
>Reporter: Vinod K V
>Assignee: Amareshwari Sriramadasu
> Fix For: 0.22.0
>
> Attachments: patch-1186-1.txt, patch-1186-2.txt, 
> patch-1186-3-ydist.txt, patch-1186-3-ydist.txt, patch-1186-3.txt, 
> patch-1186-4.txt, patch-1186-5.txt, patch-1186-ydist.txt, 
> patch-1186-ydist.txt, patch-1186.txt
>
>
> This is a performance problem.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-06 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12797404#action_12797404
 ] 

Zheng Shao commented on MAPREDUCE-1302:
---

Scott, I do use the returned value of #getRelativePathName() after comparing it 
with null.
So the other option is to have 2 functions: #isInVolume() and 
#getRelativePathName().
I prefer having only 1 function just for simplification.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-06 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-06 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-06 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.4.patch

Modified according to Vinod's comments.

I didn't change the test. I did verify the deletion in the makeSureCleanedUp() 
method.

I will deprecate JobConf.deleteLocalFiles(subdir) in a follow-up jira.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch, MAPREDUCE-1302.4.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-05 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.3.patch

Renamed SUBDIR to TOBEDELETED to avoid confusion.

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-05 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2010-01-05 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch, MAPREDUCE-1302.3.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-29 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795205#action_12795205
 ] 

Zheng Shao commented on MAPREDUCE-1302:
---

MAPREDUCE-1141 is fixed by this patch.

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-29 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-29 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-29 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12795143#action_12795143
 ] 

Zheng Shao commented on MAPREDUCE-1302:
---

The code didn't create a single task for toBeDeleted, but went through the 
toBeDeleted directory and create one task per each.
The reason for that is:
1. This allows parallel deletion of the contents inside toBeDeleted
2. A single list call per volume shouldn't take too long
3. If we want to create a single task for toBeDeleted, then we need to rename 
it to something else, and recreate toBeDeleted, and then move the old one to be 
a sub directory inside the new toBeDeleted. This will introduce additional 
intermediate states that may be hard to recover from.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-29 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.2.patch

Added logic to remove the files inside toBeDeleted upon restart.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch, 
> MAPREDUCE-1302.2.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-28 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794865#action_12794865
 ] 

Zheng Shao commented on MAPREDUCE-1302:
---

Good question. There is no special handling right now.
I will list the directory and create one task per item returned.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1270) Hadoop C++ Extention

2009-12-23 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12794299#action_12794299
 ] 

Zheng Shao commented on MAPREDUCE-1270:
---

Any progress on this?

> Hadoop C++ Extention
> 
>
> Key: MAPREDUCE-1270
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1270
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: task
>Affects Versions: 0.20.1
> Environment:  hadoop linux
>Reporter: Wang Shouyan
>
>   Hadoop C++ extension is an internal project in baidu, We start it for these 
> reasons:
>1  To provide C++ API. We mostly use Streaming before, and we also try to 
> use PIPES, but we do not find PIPES is more efficient than Streaming. So we 
> think a new C++ extention is needed for us.
>2  Even using PIPES or Streaming, it is hard to control memory of hadoop 
> map/reduce Child JVM.
>3  It costs so much to read/write/sort TB/PB data by Java. When using 
> PIPES or Streaming, pipe or socket is not efficient to carry so huge data.
>What we want to do: 
>1 We do not use map/reduce Child JVM to do any data processing, which just 
> prepares environment, starts C++ mapper, tells mapper which split it should  
> deal with, and reads report from mapper until that finished. The mapper will 
> read record, ivoke user defined map, to do partition, write spill, combine 
> and merge into file.out. We think these operations can be done by C++ code.
>2 Reducer is similar to mapper, it was started after sort finished, it 
> read from sorted files, ivoke user difined reduce, and write to user defined 
> record writer.
>3 We also intend to rewrite shuffle and sort with C++, for efficience and 
> memory control.
>at first, 1 and 2, then 3.  
>What's the difference with PIPES:
>1 Yes, We will reuse most PIPES code.
>2 And, We should do it more completely, nothing changed in scheduling and 
> management, but everything in execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-21 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12793314#action_12793314
 ] 

Zheng Shao commented on MAPREDUCE-1302:
---

The test errors are not related. 

The first one seems like random error - it didn't appear on the last hudson 
test of the same patch.
The next 3 are common to all patches' test results.

>>><<< 
>>>org.apache.hadoop.security.authorize.TestServiceLevelAuthorization.testServiceLevelAuthorization
>>><<< org.apache.hadoop.streaming.TestStreamingExitStatus.testMapFailOk
>>><<< org.apache.hadoop.streaming.TestStreamingExitStatus.testReduceFailOk
>>><<< org.apache.hadoop.streaming.TestStreamingKeyValue.testCommandLine


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-18 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-18 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

Transient errors in hudson. (user1 not found)
Submitting again.

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-17 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Open  (was: Patch Available)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-17 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-17 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.1.patch

This patch is on top of MAPREDUCE-1213 which is already committed.


> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>  Components: tasktracker
>Affects Versions: 0.20.2, 0.21.0, 0.22.0
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch, MAPREDUCE-1302.1.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1303) Merge org.apache.hadoop.mapred.CleanupQueue with MRAsyncDiskService

2009-12-16 Thread Zheng Shao (JIRA)
Merge org.apache.hadoop.mapred.CleanupQueue with MRAsyncDiskService
---

 Key: MAPREDUCE-1303
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1303
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Zheng Shao
Assignee: Zheng Shao


org.apache.hadoop.mapred.CleanupQueue is very similar to MRAsyncDiskService.
We should be able to simplify the codebase by merging it into 
MRAsyncDiskService.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-15 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Status: Patch Available  (was: Open)

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-15 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1302:
--

Attachment: MAPREDUCE-1302.0.patch

This patch includes MAPREDUCE-1213. It's just for demo purpose.

> TrackerDistributedCacheManager can delete file asynchronously
> -
>
> Key: MAPREDUCE-1302
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
> Project: Hadoop Map/Reduce
>  Issue Type: Improvement
>Reporter: Zheng Shao
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1302.0.patch
>
>
> With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
> delete files from distributed cache asynchronously.
> That will help make task initialization faster, because task initialization 
> calls the code that localizes files into the cache and may delete some other 
> files.
> The deletion can slow down the task initialization speed.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (MAPREDUCE-1302) TrackerDistributedCacheManager can delete file asynchronously

2009-12-15 Thread Zheng Shao (JIRA)
TrackerDistributedCacheManager can delete file asynchronously
-

 Key: MAPREDUCE-1302
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1302
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Zheng Shao
Assignee: Zheng Shao


With the help of AsyncDiskService from MAPREDUCE-1213, we should be able to 
delete files from distributed cache asynchronously.

That will help make task initialization faster, because task initialization 
calls the code that localizes files into the cache and may delete some other 
files.
The deletion can slow down the task initialization speed.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-15 Thread Zheng Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12791236#action_12791236
 ] 

Zheng Shao commented on MAPREDUCE-1213:
---

The contrib tests failures do not seem to be related to this patch. I saw the 
same errors on hudson in the results of other patches.


> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-15 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Status: Open  (was: Patch Available)

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-15 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Status: Patch Available  (was: Open)

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-15 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.4.patch

Changed function name to moveAndDeleteFromEachVolume.

AsyncDelete may have a different meaning - users might still see the files when 
the function returns. This code actually moves the file first.


> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch, MAPREDUCE-1213.4.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.3.patch

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: (was: MAPREDUCE-1213.3.patch)

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.3.patch

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: (was: MAPREDUCE-1213.3.patch)

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Status: Patch Available  (was: Open)

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (MAPREDUCE-1213) TaskTrackers restart is very slow because it deletes distributed cache directory synchronously

2009-12-14 Thread Zheng Shao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MAPREDUCE-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zheng Shao updated MAPREDUCE-1213:
--

Attachment: MAPREDUCE-1213.3.patch

This one uses the AsyncDiskService from common.

> TaskTrackers restart is very slow because it deletes distributed cache 
> directory synchronously
> --
>
> Key: MAPREDUCE-1213
> URL: https://issues.apache.org/jira/browse/MAPREDUCE-1213
> Project: Hadoop Map/Reduce
>  Issue Type: Bug
>Affects Versions: 0.20.1
>Reporter: dhruba borthakur
>Assignee: Zheng Shao
> Attachments: MAPREDUCE-1213.1.patch, MAPREDUCE-1213.2.patch, 
> MAPREDUCE-1213.3.patch
>
>
> We are seeing that when we restart a tasktracker, it tries to recursively 
> delete all the file in the distributed cache. It invoked 
> FileUtil.fullyDelete() which is very very slow. This means that the 
> TaskTracker cannot join the cluster for an extended period of time (upto 2 
> hours for us). The problem is acute if the number of files in a distributed 
> cache is a few-thousands.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >