Re: how to change default name of a sequnce file

2011-06-20 Thread Mapred Learn
Thanks ! I will try this ! On Sun, Jun 19, 2011 at 11:16 PM, Christoph Schmitz christoph.schm...@1und1.de wrote: Hi JJ, you can do that by subclassing TextOutputFormat (or whichever output format you're using) and overloading the getDefaultWorkFile method: public class MyOutputFormatK, V

Re: How to split a big file in HDFS by size

2011-06-20 Thread Mapred Learn
Hi Christopher, If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then run a map-red job on those 60 text fixed length files ? If yes, do you have any idea how to do this ? On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz christoph.schm...@1und1.de wrote: JJ,

Re: how to change default name of a sequnce file

2011-06-20 Thread Mapred Learn
Another question here is in getDefaultWorkFile() is that, how is it possible to find out the mapper number that is used in output. For eg, if you have 30 mappers, how can I add to output file( OutputFile) of 30th mapper - OutputFile_30 ? On Sun, Jun 19, 2011 at 11:19 PM, Mapred Learn

AW: How to split a big file in HDFS by size

2011-06-20 Thread Christoph Schmitz
Simple answer: don't. The Hadoop framework will take care of that for you and split the file. The logical 60 GB file you see in the HDFS actually *is* split into smaller chunks (default size is 64 MB) and physically distributed across the cluster. Regards, Christoph -Ursprüngliche

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Niels Basjes
Hi, On Mon, Jun 20, 2011 at 16:13, Mapred Learn mapred.le...@gmail.com wrote: But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60

Re: AW: How to split a big file in HDFS by size

2011-06-20 Thread Marcos Ortiz
Evert Lammerts at Sara.nl did something seemed to your problem, spliting a big 2.7 TB file to chunks of 10 GB. This work was presented on the BioAssist Programmers' Day on January of this year and its name was Large-Scale Data Storage and Processing for Scientist in The Netherlands

Re: controlling no. of mapper tasks

2011-06-20 Thread David Rosenstrauch
On 06/20/2011 03:24 PM, praveen.pe...@nokia.com wrote: Hi there, I know client can send mapred.reduce.tasks to specify no. of reduce tasks and hadoop honours it but mapred.map.tasks is not honoured by Hadoop. Is there any way to control number of map tasks? What I noticed is that Hadoop is

RE: controlling no. of mapper tasks

2011-06-20 Thread GOEKE, MATTHEW (AG/1000)
Praveen, David is correct but we might need to use different terminology. Hadoop looks at the number of input splits and if the file is not splittable then yes it will only use 1 mapper for it. In the case of most files (which are splittable) Hadoop will break them into multiple maps and work

Re: controlling no. of mapper tasks

2011-06-20 Thread David Rosenstrauch
Yes, that is correct. It is indeed looking at the data size. Please carefully read through again what I wrote - particularly the part about files getting broken into chunks (aka blocks). If you want fewer map tasks, then store your files in HDFS with a larger block size. They will then get

mapreduce and python

2011-06-20 Thread Hassen Riahi
Dear all, Is it possible to have a binary input to a map code written in python? Thank you Hassen

RE: controlling no. of mapper tasks

2011-06-20 Thread GOEKE, MATTHEW (AG/1000)
Praveen, We use CDH3 so the link that I refer to is http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html. The reason why it is defaulting to 2 per node is not because it looks at the number of cores but that mapred.tasktracker.map.tasks.maximum is set to 2 by default. There is a

Re: controlling no. of mapper tasks

2011-06-20 Thread sridhar basam
On Mon, Jun 20, 2011 at 4:13 PM, praveen.pe...@nokia.com wrote: Hi David, Thanks for the response. I didn't specify anything for no. of concurrent mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I believe hadoop is defaulting to no. of cores in the cluster which is

RE: mapreduce and python

2011-06-20 Thread GOEKE, MATTHEW (AG/1000)
Hassen, If you would like to use python as input I would suggest looking into the streaming api examples around Hadoop. Matt -Original Message- From: Hassen Riahi [mailto:hassen.ri...@cern.ch] Sent: Monday, June 20, 2011 3:13 PM To: mapreduce-user@hadoop.apache.org Subject: mapreduce

Re: mapreduce and python

2011-06-20 Thread Hassen Riahi
Thanks Matt for your reply... I would like to use a binary file as input to a map code written in python. All examples that I found take a text file as input to a map code (in python). Any ideas? is it feasible? Hassen Hassen, If you would like to use python as input I would suggest

RE: mapreduce and python

2011-06-20 Thread GOEKE, MATTHEW (AG/1000)
You might want to chase down leads around https://issues.apache.org/jira/browse/MAPREDUCE-606. It looks like there is a patch for it on Jira but I am not quite sure if it is working. If it is worth it to you to keep it in python then it might be worth it to tinker with the patch... HTH, Matt

Re: mapreduce and python

2011-06-20 Thread Jeremy Lewi
Hassen, I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes as a solution for using python to implement mappers and reducers. TypedBytes is a hadoop encoding format that allows binary data (including lists and maps) to be encoded in a format that permits the serialized data to

Re: mapreduce and python

2011-06-20 Thread Harsh J
I'd like to +1 to using Dumbo for all things Python and Hadoop MapReduce. Its one of the better ways to do things. Do look at the initial conversation here: http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html as well. The feature/bug fixes specified in the post