Thanks !
I will try this !
On Sun, Jun 19, 2011 at 11:16 PM, Christoph Schmitz
christoph.schm...@1und1.de wrote:
Hi JJ,
you can do that by subclassing TextOutputFormat (or whichever output format
you're using) and overloading the getDefaultWorkFile method:
public class MyOutputFormatK, V
Hi Christopher,
If I get all 60 Gb on HDFs, can I then split it into 60 1 Gb files and then
run a map-red job on those 60 text fixed length files ? If yes, do you have
any idea how to do this ?
On Sun, Jun 19, 2011 at 11:28 PM, Christoph Schmitz
christoph.schm...@1und1.de wrote:
JJ,
Another question here is in getDefaultWorkFile() is that, how is it possible
to find out the mapper number that is used in output. For eg, if you have 30
mappers, how can I add to output file( OutputFile) of 30th mapper -
OutputFile_30 ?
On Sun, Jun 19, 2011 at 11:19 PM, Mapred Learn
Simple answer: don't. The Hadoop framework will take care of that for you and
split the file. The logical 60 GB file you see in the HDFS actually *is* split
into smaller chunks (default size is 64 MB) and physically distributed across
the cluster.
Regards,
Christoph
-Ursprüngliche
Hi,
On Mon, Jun 20, 2011 at 16:13, Mapred Learn mapred.le...@gmail.com wrote:
But this file is a gzipped text file. In this case, it will only go to 1
mapper than the case if it was
split into 60 1 GB files which will make map-red job finish earlier than one
60 GB file as it will
Hv 60
Evert Lammerts at Sara.nl did something seemed to your problem, spliting
a big 2.7 TB file to chunks of 10 GB.
This work was presented on the BioAssist Programmers' Day on January of
this year and its name was
Large-Scale Data Storage and Processing for Scientist in The Netherlands
On 06/20/2011 03:24 PM, praveen.pe...@nokia.com wrote:
Hi there, I know client can send mapred.reduce.tasks to specify no.
of reduce tasks and hadoop honours it but mapred.map.tasks is not
honoured by Hadoop. Is there any way to control number of map tasks?
What I noticed is that Hadoop is
Praveen,
David is correct but we might need to use different terminology. Hadoop looks
at the number of input splits and if the file is not splittable then yes it
will only use 1 mapper for it. In the case of most files (which are splittable)
Hadoop will break them into multiple maps and work
Yes, that is correct. It is indeed looking at the data size. Please
carefully read through again what I wrote - particularly the part about
files getting broken into chunks (aka blocks). If you want fewer map
tasks, then store your files in HDFS with a larger block size. They
will then get
Dear all,
Is it possible to have a binary input to a map code written in python?
Thank you
Hassen
Praveen,
We use CDH3 so the link that I refer to is
http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html. The reason
why it is defaulting to 2 per node is not because it looks at the number of
cores but that mapred.tasktracker.map.tasks.maximum is set to 2 by default.
There is a
On Mon, Jun 20, 2011 at 4:13 PM, praveen.pe...@nokia.com wrote:
Hi David,
Thanks for the response. I didn't specify anything for no. of concurrent
mappers but I do see that it shows as 10 on 50030 (for 5 node cluster). So I
believe hadoop is defaulting to no. of cores in the cluster which is
Hassen,
If you would like to use python as input I would suggest looking into the
streaming api examples around Hadoop.
Matt
-Original Message-
From: Hassen Riahi [mailto:hassen.ri...@cern.ch]
Sent: Monday, June 20, 2011 3:13 PM
To: mapreduce-user@hadoop.apache.org
Subject: mapreduce
Thanks Matt for your reply...
I would like to use a binary file as input to a map code written in
python.
All examples that I found take a text file as input to a map code (in
python). Any ideas? is it feasible?
Hassen
Hassen,
If you would like to use python as input I would suggest
You might want to chase down leads around
https://issues.apache.org/jira/browse/MAPREDUCE-606. It looks like there is a
patch for it on Jira but I am not quite sure if it is working. If it is worth
it to you to keep it in python then it might be worth it to tinker with the
patch...
HTH,
Matt
Hassen,
I've been very succesful using Hadoop Streaming, Dumbo, and TypedBytes
as a solution for using python to implement mappers and reducers.
TypedBytes is a hadoop encoding format that allows binary data
(including lists and maps) to be encoded in a format that permits the
serialized data to
I'd like to +1 to using Dumbo for all things Python and Hadoop
MapReduce. Its one of the better ways to do things.
Do look at the initial conversation here:
http://old.nabble.com/hadoop-streaming-binary-input---image-processing-td23544344.html
as well.
The feature/bug fixes specified in the post
17 matches
Mail list logo