Re: use S3 as input to MR job

2012-07-19 Thread Harsh J
Dan, Can you share your error? The plain .gz files (not .tar.gz) are natively supported by Hadoop via its GzipCodec, and if you are facing an error, I believe its cause of something other than compression. On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi wrote: > i have a MR job to read file on amazon S

use S3 as input to MR job

2012-07-19 Thread Dan Yi
i have a MR job to read file on amazon S3 and process the data on local hdfs. the files are zipped text file as .gz. i tried to setup the job as below but it won't work, anyone know what might be wrong? do i need to add extra step to unzip the file first? thanks. String S3_LOCATION = "s3n://ac

Lines missing from output files (0.20.205.0)

2012-07-19 Thread Berry, Matt
I have a slightly modified Text Output Format that essentially writes each key into its own file. It operates off the premise that my reducer is an identity function and it emits each record one-by-one in the order they come from the collection. Because the records are emitted in order from the

Re: location of Java heap dumps

2012-07-19 Thread Harsh J
You need to ask your job to not discard failed task files. Else they get cleared away (except for logs) and thats why you do not see it anymore afterwards. If you're using 1.x/0.20.x, set "keep.failed.task.files" to true in your JobConf/Job.getConfiguration objects, before submitting your job. Aft

Re: OutputFormat Theory Question

2012-07-19 Thread Harsh J
Matt, The reducer's reduce(Key, ) call does proceed in sorted order. You can safely assume that when the next reduce call begins, you will no longer get the previous Key again, and can hence close your file. This is guaranteed by the sorter framework and several tests in MR land cover this. On Th

OutputFormat Theory Question

2012-07-19 Thread Berry, Matt
>From what I gather about how Map Reduce operates, there isn't really any >functional difference between whether a single OutputFormat object is >initialized on a central node or if each reducer task initializes its own >OutputFormat object. What I would like to know however, is the relationshi

RE: location of Java heap dumps

2012-07-19 Thread Marek Miglinski
Thanks Markus, But as I said, I have only read access on the nodes and I can't make that change. So the question open. Marek M. From: Markus Jelsma [markus.jel...@openindex.io] Sent: Wednesday, July 18, 2012 9:06 PM To: mapreduce-user@hadoop.apache.org S