[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files

2010-06-15 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879022#action_12879022
 ] 

David Ciemiewicz commented on MAPREDUCE-469:


Greg, I cannot think of any good reasons to keep the current behavior of 
failing to read concatenated gzip files. So, requiring end users to actively 
set a flag io.compression.gzip.concat to permit meeting expectations of every 
user seems ass backwards. Reading concatenated files should be the default 
behavior.

 Support concatenated gzip and bzip2 files
 -

 Key: MAPREDUCE-469
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Tom White
Assignee: Greg Roelofs
 Attachments: grr-hadoop-common.dif.20100614c, 
 grr-hadoop-mapreduce.dif.20100614c


 When running MapReduce with concatenated gzip files as input only the first 
 part is read, which is confusing, to say the least. Concatenated gzip is 
 described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
 and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
 http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files

2010-06-15 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879160#action_12879160
 ] 

David Ciemiewicz commented on MAPREDUCE-469:


Greg, I have yet to encounter ANYONE who doesn't consider this a bug because 
all cited reference EXPECT concatenated files to work because the work in ALL 
OTHER cited instances including gnu tools, web browsers, etc.  Can you think of 
a single instance where it would be the right thing to stop reading a 
concatenated file after the first part is read, ignoring all other concatenated 
parts. Forgive me but suggesting that we keep the existing behavior seems 
absurd because I cannot think of a single case where this would be the right 
thing to do.


 Support concatenated gzip and bzip2 files
 -

 Key: MAPREDUCE-469
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Tom White
Assignee: Greg Roelofs
 Attachments: grr-hadoop-common.dif.20100614c, 
 grr-hadoop-mapreduce.dif.20100614c


 When running MapReduce with concatenated gzip files as input only the first 
 part is read, which is confusing, to say the least. Concatenated gzip is 
 described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
 and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
 http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files

2010-06-14 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878848#action_12878848
 ] 

David Ciemiewicz commented on MAPREDUCE-469:


On vacation Mon-Wed Feb 15-17. Offsite Thu-Fri, Feb 18-19.


 Support concatenated gzip and bzip2 files
 -

 Key: MAPREDUCE-469
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Tom White
Assignee: Greg Roelofs
 Attachments: grr-hadoop-common.dif.20100614c, 
 grr-hadoop-mapreduce.dif.20100614c


 When running MapReduce with concatenated gzip files as input only the first 
 part is read, which is confusing, to say the least. Concatenated gzip is 
 described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
 and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
 http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-477) Support for reading bzip2 compressed file created using concatenation of multiple .bz2 files

2010-05-26 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871853#action_12871853
 ] 

David Ciemiewicz commented on MAPREDUCE-477:


This does not appear to be solved in the version of hadoop that I am using: 
Hadoop 0.20.10.0.1004192217

I cannot speak as to whether or not this is fixed in the trunk.

I created two files file1.bz2 and file2.bz2 and concatenated them into 
file12.bz2

-bash-3.1$ bzcat file12.bz2
contents of file1.bz2
contents of file2.bz2

I then run a simple pig script to dump the contents of this file:

-bash-3.1$ cat concat.pig
A = load 'file12.bz2' using PigStorage();
dump A;


The output below shows that only the first file in the concatenation is read. 
The subsequent file is not read.

-bash-3.1$ pig -Dmapred.job.queue.name=... concat.pig
USING: /grid/0/gs/pig/current
2010-05-26 17:54:06,501 [main] INFO  org.apache.pig.Main - Logging error 
messages to: /homes/ciemo/.../pig_1274896446499.log
2010-05-26 17:54:06,750 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
hadoop file system at: hdfs://...:8020
2010-05-26 17:54:07,001 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to 
map-reduce job tracker at: ...:50300
2010-05-26 17:54:07,830 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size before optimization: 1
2010-05-26 17:54:07,830 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer
 - MR plan size after optimization: 1
2010-05-26 17:54:08,804 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler 
- Setting up single store job
2010-05-26 17:54:08,835 [Thread-9] WARN  org.apache.hadoop.mapred.JobClient - 
Use GenericOptionsParser for parsing the arguments. Applications should 
implement Tool for the same.
2010-05-26 17:54:09,834 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Cannot get jobid for this job
2010-05-26 17:54:32,745 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 0% complete
2010-05-26 17:55:09,412 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- 100% complete
2010-05-26 17:55:09,412 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Successfully stored result in: hdfs://...
2010-05-26 17:55:11,158 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Records written : 1
2010-05-26 17:55:11,159 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Bytes written : 34
2010-05-26 17:55:11,159 [main] INFO  
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher 
- Success!
(contents of file1.bz2)

The dump should have shown both file1.bz2 and file2.bz2

 Support for reading bzip2 compressed file created using concatenation of 
 multiple .bz2 files 
 -

 Key: MAPREDUCE-477
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-477
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Suhas Gogate
Priority: Minor

 Bzip2Codec supported in Hadoop 0.19/0.20  should support for reading bzip2 
 compressed file created using concatenation of multiple .bz2 files 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1545) Add 'first-task-launched' to job-summary

2010-05-18 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868917#action_12868917
 ] 

David Ciemiewicz commented on MAPREDUCE-1545:
-

Is this only for first task or are we measuring wait times in pending state 
for ALL tasks?

Last week, we ran out of reduce slots.  I had one job where 150 of my reduce 
tasks had completed but other 150 were in a pending state (not running) for 
over 1 hour.

Are we measuring pending time for all tasks or only for first task?

 Add 'first-task-launched' to job-summary
 

 Key: MAPREDUCE-1545
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1545
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: jobtracker
Reporter: Arun C Murthy
Assignee: Luke Lu
 Fix For: 0.22.0

 Attachments: mr-1545-trunk-v1.patch, mr-1545-trunk-v2.patch, 
 mr-1545-y20s-v1.patch, mr-1545-y20s-v2.patch, mr-1545-y20s-v3.patch


 It would be useful to track 'first-task-launched' time to job-summary for 
 better reporting.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-1485) CapacityScheduler should have prevent a single job taking over large parts of a cluster

2010-04-07 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854609#action_12854609
 ] 

David Ciemiewicz commented on MAPREDUCE-1485:
-

I would also be good if any single user was limited to as to the total number 
of tasks.

This way it doesn't matter if a user has a single job with 20,000 tasks or many 
20 jobs with 1,000 tasks.


 CapacityScheduler should have prevent a single job taking over large parts of 
 a cluster
 ---

 Key: MAPREDUCE-1485
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1485
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
  Components: contrib/capacity-sched
Reporter: Arun C Murthy
Assignee: Arun C Murthy
 Fix For: 0.22.0


 The proposal is to have a per-queue limit on the number of concurrent tasks a 
 job can run on a cluster. 
 We've seen cases where a single, large, job took over a majority of the 
 cluster - worse, it meant that any bug in it caused issues for both the 
 NameNode _and_ the JobTracker.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files

2010-04-06 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854268#action_12854268
 ] 

David Ciemiewicz commented on MAPREDUCE-469:


Unfortunately I discovered that concatenated bzip2 files did not work in 
Map-Reduce until *AFTER* I went and concatenated 3TB and over 250K compressed 
files.

A colleague suggested that I fix my data using the following approach:

hadoop dfs -cat X | bunzip2 | bzip2 | hadoop dfs -put - X.new

I tried this with a 3GB single file concatenation of multiple bzip2 compressed 
files.

This process took just over an hour with compression taking 5-6X longer than 
decompression (as measured in CPU utilization).

It only took several minutes to concatenate the multiple part files into a 
single file.


I think that this points out that decompressing and recompressing data is not 
really a viable solution for creating large concatenations of smaller files.

The best performing solution is to create the smaller part files in parallel 
with a bunch of reducers, then concatenate them later into one (or several) 
larger files.

And so fixing Hadoop Map Reduce to be able to read concatenations of files is 
actually probably the highest return on investment by the community.




 Support concatenated gzip and bzip2 files
 -

 Key: MAPREDUCE-469
 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469
 Project: Hadoop Map/Reduce
  Issue Type: Improvement
Reporter: Tom White
Assignee: Ravi Gummadi

 When running MapReduce with concatenated gzip files as input only the first 
 part is read, which is confusing, to say the least. Concatenated gzip is 
 described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage 
 and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at 
 http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.