Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Niels Basjes
Hi, 2011/1/31 Sean Bigdatafun sean.bigdata...@gmail.com: GZIP is not splittable. Correct, gzip is a stream compression system which effectively means you can only start at the beginning of the data with decompressing. Does that mean a GZIP block compressed sequencefile can't take advantage of

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Harsh J
On Mon, Jan 31, 2011 at 1:56 PM, Sean Bigdatafun sean.bigdata...@gmail.com wrote: How to control the size of block to be compressed in SequenceFile? Specified when creating a SequenceFile.Writer object. See the various SequenceFile.createWriter() -- Harsh J www.harshj.com

What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

2011-01-31 Thread Kiss Tibor
Hi! I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of work, one of the worker nodes just simply died (I cannot connect to the instance either). That node also appears on the dfshealth as dead node. Until now everything is normal. Unfortunately the job it was running

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Sean Bigdatafun
On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes ni...@basjes.nl wrote: Hi, 2011/1/31 Sean Bigdatafun sean.bigdata...@gmail.com: GZIP is not splittable. Correct, gzip is a stream compression system which effectively means you can only start at the beginning of the data with decompressing.

Re: Draining/Decommisioning a tasktracker

2011-01-31 Thread rishi pathak
Still need to figure out whether a queue can be associated with a TT. i.e. TT acl for a queue in which tasks submitted to that queue will only be relayed to TT in the acl list for the queue. On Mon, Jan 31, 2011 at 10:51 PM, rishi pathak mailmaverick...@gmail.comwrote: Hi Koji,

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

2011-01-31 Thread Harsh J
Hello, On Mon, Jan 31, 2011 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com wrote: On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes ni...@basjes.nl wrote: Hi, 2011/1/31 Sean Bigdatafun sean.bigdata...@gmail.com: GZIP is not splittable. Correct, gzip is a stream compression system

Retrieve FileStatus given a file path?

2011-01-31 Thread Pedro Costa
Hi, On the reduce side, after the RT had passed the merge phase (before the reduce phase starts), I've got the path of the map_0.out file. I'm opening this file with [code] FSDataInputStream in = fs.open(file); [/code] But, I only got the path. Is it possible to obtain the file status of this

Re: Retrieve FileStatus given a file path?

2011-01-31 Thread Pedro Costa
I said file status, but what I would like to know is the size of the file. On Mon, Jan 31, 2011 at 5:56 PM, Pedro Costa psdc1...@gmail.com wrote: Hi, On the reduce side, after the RT had passed the merge phase (before the reduce phase starts), I've got the path of the map_0.out file. I'm

Re: Retrieve FileStatus given a file path?

2011-01-31 Thread Harsh J
FileSystem.getFileStatus(Path path) should return you the goodies, using an appropriate FileSystem implementation (Hint: URI). On Mon, Jan 31, 2011 at 11:30 PM, Pedro Costa psdc1...@gmail.com wrote: I said file status, but what I would like to know is the size of the file. On Mon, Jan 31, 2011

Map output in disk and memory at same time?

2011-01-31 Thread Pedro Costa
Hi, When the reduce fetch from the mappers a map output of the size of 1GB and do the merge, is it possible that part of the map output is saved in disk and other part in memory? Or a map output must be saved all in disk, or all in memory? Thanks, -- Pedro

Re: Map output in disk and memory at same time?

2011-01-31 Thread Arun C Murthy
On Jan 31, 2011, at 10:51 AM, Pedro Costa wrote: Hi, When the reduce fetch from the mappers a map output of the size of 1GB and do the merge, is it possible that part of the map output is saved in disk and other part in memory? Yes, the reduce tries to keep as much in memory as possible.

Re: What are the conditions or which is the status of re-scheduling feature for failed attempts caused by dying a node?

2011-01-31 Thread Arun C Murthy
Please don't cross-post, CDH questions should go to their user lists. On Jan 31, 2011, at 6:15 AM, Kiss Tibor wrote: Hi! I was running a Hadoop cluster on Amazon EC2 instances, then after 2 days of work, one of the worker nodes just simply died (I cannot connect to the instance either).