[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879022#action_12879022 ] David Ciemiewicz commented on MAPREDUCE-469: Greg, I cannot think of any good reasons to keep the current behavior of failing to read concatenated gzip files. So, requiring end users to actively set a flag io.compression.gzip.concat to permit meeting expectations of every user seems ass backwards. Reading concatenated files should be the default behavior. Support concatenated gzip and bzip2 files - Key: MAPREDUCE-469 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Tom White Assignee: Greg Roelofs Attachments: grr-hadoop-common.dif.20100614c, grr-hadoop-mapreduce.dif.20100614c When running MapReduce with concatenated gzip files as input only the first part is read, which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12879160#action_12879160 ] David Ciemiewicz commented on MAPREDUCE-469: Greg, I have yet to encounter ANYONE who doesn't consider this a bug because all cited reference EXPECT concatenated files to work because the work in ALL OTHER cited instances including gnu tools, web browsers, etc. Can you think of a single instance where it would be the right thing to stop reading a concatenated file after the first part is read, ignoring all other concatenated parts. Forgive me but suggesting that we keep the existing behavior seems absurd because I cannot think of a single case where this would be the right thing to do. Support concatenated gzip and bzip2 files - Key: MAPREDUCE-469 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Tom White Assignee: Greg Roelofs Attachments: grr-hadoop-common.dif.20100614c, grr-hadoop-mapreduce.dif.20100614c When running MapReduce with concatenated gzip files as input only the first part is read, which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12878848#action_12878848 ] David Ciemiewicz commented on MAPREDUCE-469: On vacation Mon-Wed Feb 15-17. Offsite Thu-Fri, Feb 18-19. Support concatenated gzip and bzip2 files - Key: MAPREDUCE-469 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Tom White Assignee: Greg Roelofs Attachments: grr-hadoop-common.dif.20100614c, grr-hadoop-mapreduce.dif.20100614c When running MapReduce with concatenated gzip files as input only the first part is read, which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-477) Support for reading bzip2 compressed file created using concatenation of multiple .bz2 files
[ https://issues.apache.org/jira/browse/MAPREDUCE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12871853#action_12871853 ] David Ciemiewicz commented on MAPREDUCE-477: This does not appear to be solved in the version of hadoop that I am using: Hadoop 0.20.10.0.1004192217 I cannot speak as to whether or not this is fixed in the trunk. I created two files file1.bz2 and file2.bz2 and concatenated them into file12.bz2 -bash-3.1$ bzcat file12.bz2 contents of file1.bz2 contents of file2.bz2 I then run a simple pig script to dump the contents of this file: -bash-3.1$ cat concat.pig A = load 'file12.bz2' using PigStorage(); dump A; The output below shows that only the first file in the concatenation is read. The subsequent file is not read. -bash-3.1$ pig -Dmapred.job.queue.name=... concat.pig USING: /grid/0/gs/pig/current 2010-05-26 17:54:06,501 [main] INFO org.apache.pig.Main - Logging error messages to: /homes/ciemo/.../pig_1274896446499.log 2010-05-26 17:54:06,750 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://...:8020 2010-05-26 17:54:07,001 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: ...:50300 2010-05-26 17:54:07,830 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 2010-05-26 17:54:07,830 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 2010-05-26 17:54:08,804 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job 2010-05-26 17:54:08,835 [Thread-9] WARN org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2010-05-26 17:54:09,834 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Cannot get jobid for this job 2010-05-26 17:54:32,745 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete 2010-05-26 17:55:09,412 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete 2010-05-26 17:55:09,412 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Successfully stored result in: hdfs://... 2010-05-26 17:55:11,158 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Records written : 1 2010-05-26 17:55:11,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Bytes written : 34 2010-05-26 17:55:11,159 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! (contents of file1.bz2) The dump should have shown both file1.bz2 and file2.bz2 Support for reading bzip2 compressed file created using concatenation of multiple .bz2 files - Key: MAPREDUCE-477 URL: https://issues.apache.org/jira/browse/MAPREDUCE-477 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Suhas Gogate Priority: Minor Bzip2Codec supported in Hadoop 0.19/0.20 should support for reading bzip2 compressed file created using concatenation of multiple .bz2 files -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1545) Add 'first-task-launched' to job-summary
[ https://issues.apache.org/jira/browse/MAPREDUCE-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12868917#action_12868917 ] David Ciemiewicz commented on MAPREDUCE-1545: - Is this only for first task or are we measuring wait times in pending state for ALL tasks? Last week, we ran out of reduce slots. I had one job where 150 of my reduce tasks had completed but other 150 were in a pending state (not running) for over 1 hour. Are we measuring pending time for all tasks or only for first task? Add 'first-task-launched' to job-summary Key: MAPREDUCE-1545 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1545 Project: Hadoop Map/Reduce Issue Type: Improvement Components: jobtracker Reporter: Arun C Murthy Assignee: Luke Lu Fix For: 0.22.0 Attachments: mr-1545-trunk-v1.patch, mr-1545-trunk-v2.patch, mr-1545-y20s-v1.patch, mr-1545-y20s-v2.patch, mr-1545-y20s-v3.patch It would be useful to track 'first-task-launched' time to job-summary for better reporting. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-1485) CapacityScheduler should have prevent a single job taking over large parts of a cluster
[ https://issues.apache.org/jira/browse/MAPREDUCE-1485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854609#action_12854609 ] David Ciemiewicz commented on MAPREDUCE-1485: - I would also be good if any single user was limited to as to the total number of tasks. This way it doesn't matter if a user has a single job with 20,000 tasks or many 20 jobs with 1,000 tasks. CapacityScheduler should have prevent a single job taking over large parts of a cluster --- Key: MAPREDUCE-1485 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1485 Project: Hadoop Map/Reduce Issue Type: Improvement Components: contrib/capacity-sched Reporter: Arun C Murthy Assignee: Arun C Murthy Fix For: 0.22.0 The proposal is to have a per-queue limit on the number of concurrent tasks a job can run on a cluster. We've seen cases where a single, large, job took over a majority of the cluster - worse, it meant that any bug in it caused issues for both the NameNode _and_ the JobTracker. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (MAPREDUCE-469) Support concatenated gzip and bzip2 files
[ https://issues.apache.org/jira/browse/MAPREDUCE-469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12854268#action_12854268 ] David Ciemiewicz commented on MAPREDUCE-469: Unfortunately I discovered that concatenated bzip2 files did not work in Map-Reduce until *AFTER* I went and concatenated 3TB and over 250K compressed files. A colleague suggested that I fix my data using the following approach: hadoop dfs -cat X | bunzip2 | bzip2 | hadoop dfs -put - X.new I tried this with a 3GB single file concatenation of multiple bzip2 compressed files. This process took just over an hour with compression taking 5-6X longer than decompression (as measured in CPU utilization). It only took several minutes to concatenate the multiple part files into a single file. I think that this points out that decompressing and recompressing data is not really a viable solution for creating large concatenations of smaller files. The best performing solution is to create the smaller part files in parallel with a bunch of reducers, then concatenate them later into one (or several) larger files. And so fixing Hadoop Map Reduce to be able to read concatenations of files is actually probably the highest return on investment by the community. Support concatenated gzip and bzip2 files - Key: MAPREDUCE-469 URL: https://issues.apache.org/jira/browse/MAPREDUCE-469 Project: Hadoop Map/Reduce Issue Type: Improvement Reporter: Tom White Assignee: Ravi Gummadi When running MapReduce with concatenated gzip files as input only the first part is read, which is confusing, to say the least. Concatenated gzip is described in http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage and in http://www.ietf.org/rfc/rfc1952.txt. (See original report at http://www.nabble.com/Problem-with-Hadoop-and-concatenated-gzip-files-to21383097.html) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.