What does the jobtracker web page say is the total reduce capacity?
-Joey
On Mar 10, 2012, at 5:39, WangRamon wrote:
> Hi All
>
> I'm using Hadoop-0.20-append, the cluster contains 3 nodes, for each node I
> have 14 map and 14 reduce slots, here is the configuration:
>
>
>
>
Are you asking who assigns the task to HostB or who makes sure a task assigned
to HostB reads from HostB's local copy?
The first is the job tracker. The second is the DFSClient used by the task.
-Joey
On Mar 7, 2012, at 7:57, Pedro Costa wrote:
> Hi,
>
> In MapReduce, if the locations of th
You don't need to call readFields(), the FileStatus objects are
already initialized. You should just be able to call the various
getters to get the fields that you're interested in.
-Joey
On Mon, Mar 5, 2012 at 9:03 AM, Piyush Kansal wrote:
> Harsh,
>
> When I trying to readFields as follows:
>
Most people use either the fair or capacity schedulers. If you read those links
I sent earlier, you can decide which better fits your use cases.
-Joey
Sent from my iPhone
On Mar 4, 2012, at 14:44, Mohit Anchlia wrote:
>
>
> On Sun, Mar 4, 2012 at 4:15 AM, Joey Echeverria w
than the number of TaskTrackers,
you're much more likely to get node-local assignments.
-Joey
On Sat, Mar 3, 2012 at 10:44 PM, Mohit Anchlia wrote:
> On Sat, Mar 3, 2012 at 7:41 PM, Joey Echeverria wrote:
>>
>> Sorry, I meant have you set the mapred.jobtracker.taskScheduler
Sorry, I meant have you set the mapred.jobtracker.taskScheduler
property in your mapred-site.xml file. If not, you're using the
standard, FIFO scheduler. The default scheduler doesn't do data-local
scheduling, but the fair scheduler and capacity scheduler do. You want
to set mapred.jobtracker.taskS
Which scheduler are you using?
-Joey
On Mar 3, 2012, at 18:52, Hassen Riahi wrote:
> Hi all,
>
> We tried using mapreduce to execute a simple map code which read a txt file
> stored in HDFS and write then the output.
> The file to read is a very small one. It was not split and written entirel
Try adding the log4j.properties file to he distributed cache, e.g.:
hadoop jar job.jar -config conf -files conf/log4j.properties my.package.Class
arg1
-Joey
On Feb 29, 2012, at 16:15, GUOJUN Zhu wrote:
>
> What I found out is that the default conf/log4j.properties set root with INFO
> and
Have you checked out this example:
https://cwiki.apache.org/confluence/display/MRUNIT/Testing+Word+Count
On Mon, Feb 27, 2012 at 2:54 PM, Akhtar Muhammad Din
wrote:
> Hi,
> I have been looking for a way to do unit testing of map reduce programs too.
> There is not much of help or documentation a
It looks like your partitioner is an inner class. Try making it static:
public static class MOPartition extends Partitioner
public MOPartition() {}
On Fri, Feb 24, 2012 at 3:48 PM, Piyush Kansal wrote:
> Hi,
>
> I am right now stuck with an issue while extending the Partitioner class:
>
Are you using one of the security enabled releases of Hadoop
(0.20.20x,1.0.x,0.23.x,CDH3)? Assuming you are, you need to do something
like the following to impersonate a user:
You'll need to modify your code to use something like this:
UserGroupInformation.createRemoteUser("cuser").doAs(new
Privi
per on job-tracker.
>
> Thanks,
> Thamizh
>
>
> On Thu, Feb 16, 2012 at 6:56 PM, Joey Echeverria wrote:
>
>> Hi Tamil,
>>
>> I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
>> your question, most input formats should set the num
Hi Tamil,
I'd recommend upgrading to a newer release as 0.19.2 is very old. As for
your question, most input formats should set the number mappers correctly.
What input format are you using? Where did you see the number of tasks it
assigned to the job?
-Joey
On Thu, Feb 16, 2012 at 1:40 AM, Tham
Reduce output is normally stored in HDFS, just like your other files.
Are you seeing different behavior?
-Joey
On Sun, Jan 29, 2012 at 1:05 AM, aliyeh saeedi wrote:
> Hi
> I want to save reducers outputs like other files in Hadoop. Does NameNode
> keep any information about them? How can I do th
I'd add crunch (https://github.com/cloudera/crunch) and remove Hoop as
it's integrated with Hadoop in 0.23.1+.
-Joey
On Sat, Jan 28, 2012 at 10:59 AM, Ayad Al-Qershi wrote:
> I'm compiling a list of all Hadoop ecosystem/sub projects ordered
> alphabetically and I need your help if I missed somet
ting the use at the cluster gateway the only way? Once the user is
> in the cluster, if I am not wrong the user can pretend as any user.
>
> Praveen
>
> On Thu, Dec 29, 2011 at 8:49 PM, Joey Echeverria wrote:
>>
>> Yes, it means that 0.22 doesn't support Kerbero
Yes, it means that 0.22 doesn't support Kerberos.
-Joey
On Thu, Dec 29, 2011 at 9:41 AM, Praveen Sripati
wrote:
> Hi,
>
> The release notes for 0.22
> (http://hadoop.apache.org/common/releases.html#10+December%2C+2011%3A+release+0.22.0+available)
> it says
>
>>The following features are not supp
Hi James,
By default, there is no guarantees on value order. Using some of the
more advanced API features, you can perform a secondary sort of
values. You can read a good example of it here:
http://sonerbalkir.blogspot.com/2010/01/simulating-secondary-sort-on-values.html
-Joey
On Mon, Dec 5, 20
You want option 3.
Option 1 is only used to compress intermediate output, it doesn't apply to
map only jobs.
Option 2 only enables compression for SequenceFileOutputFormat. If you're
not using that output format, it won't help.
-Joey
On Monday, November 7, 2011, Claudio Martella wrote:
> Hello
Yes, you can read the file in the configure() (old api) and setup()
(new api) methods. The data can be saved in a variable that will be
accessible to every call to map().
-Joey
On Mon, Oct 31, 2011 at 7:45 PM, Arko Provo Mukherjee
wrote:
> Hello,
> I have a situation where I am reading a big fil
Have you looked into bulk imports? You can write your data into HDFS
and then run a MapReduce job to generate the files that HBase uses to
serve data. After the job finishes, there's a utility to copy the
files into HBase's directory and your data is visible. Check out
http://hbase.apache.org/bulk-
You can also check out Apache Whirr (http://whirr.apache.org/) if you
decide to roll your own Hadoop clusters on EC2. It's crazy easy to get
a cluster up and running with it.
-Joey
On Wed, Oct 26, 2011 at 3:04 PM, Kai Ju Liu wrote:
> Hi Arun. Thanks for the prompt reply! It's a bit of a bummer t
> Is the configured amount of tasks for reuse a suggestion or will it actually
> use it? For example, if I’ve configured it to use a JVM for 4 tasks, will a
> TaskTracker that has 8 tasks to process use 2 JVMs? Or does it decide if it
> actually wants to reuse one up to the maximum configured num
ustom application master for his job type right?
>
> Matt
>
> -Original Message-
> From: Joey Echeverria [mailto:j...@cloudera.com]
> Sent: Tuesday, October 04, 2011 11:06 AM
> To: mapreduce-user@hadoop.apache.org
> Subject: Re: Submitting a Hadoop task from withing a r
You may want to check out Yarn, coming in Hadoop 0.23:
https://issues.apache.org/jira/browse/MAPREDUCE-279
-Joey
On Tue, Oct 4, 2011 at 11:45 AM, Yaron Gonen wrote:
> Hi,
> Hadoop tasks are always stacked to form a linear user-managed workflow (a
> reduce step cannot start before all previous m
The ulimit should be set to 1.5 times the heap. One thing to note is the
unit is on KB.
-Joey
On Sep 30, 2011 1:24 PM, "Steve Lewis" wrote:
> I have a small hadoop task which is running out of memory on a colleague's
> cluster.
> I looked at has mapred-site.xml and find
>
>
> mapred.child.java.o
> The question is: the intermediary (before any reducer) results of completed
> individual tasks are recorded in the HDFS, right? So why are these results
> discarded, since the lost of the tasktracker is not the lost of already
> processed data?
Intermediate results are stored on the local disks
Print out go to the task logs. You can see those in the log directory on the
tasktracker nodes or through the jobtracker web GUI.
-Joey
On Sep 26, 2011, at 19:47, Arko Provo Mukherjee
wrote:
> Hi,
>
> I am writing some Map Reduce programs in pseudo-distributed mode.
>
> I am getting some
Doesn't look like it to me:
http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/Job.html
2011/9/23 谭军 :
> Joey Echeverria,
> Yes, that works.
> I thought job.addCacheFile(new URI(args[0])); could run on hadoop-0.20.2.
> Because hadoop-0.20.2 could r
> Hi Joey Echeverria,
> My hadoop version is 0.20.2
>
> --
>
> Regards!
>
> Jun Tan
>
> 在 2011-09-24 11:36:08,"Joey Echeverria" 写道:
>>Which version of Hadoop are you using?
>>
>>2011/9/23 谭军 :
>>> Harsh,
>>> It is java.net.URI
Which version of Hadoop are you using?
2011/9/23 谭军 :
> Harsh,
> It is java.net.URI that is imported.
>
> --
>
> Regards!
>
> Jun Tan
>
> At 2011-09-24 00:52:14,"Harsh J" wrote:
>>Jun,
>>
>>Common cause is that your URI class is not the right import.
>>
>>It must be java.net.URI and not any other
Do you have assign multiple enabled in the fair scheduler? Event that
may not be able to keep up if the tasks are only taking 10 seconds.
Any way you could run the job with less splits?
On Thu, Sep 22, 2011 at 1:21 PM, Adam Shook wrote:
> Okay, I put a Thread.sleep to test my theory and it will r
#x27;Fraction of the number of maps in the job which should be complete
> before reduces are scheduled for the job.'
>
> Shouldn't the map tasks be completed before the reduce tasks are kicked for
> a particular job?
>
> Praveen
>
> On Thu, Sep 22, 2011 at 6:53 P
The jobs would run in parallel since J1 doesn't use all of your map
tasks. Things get more interesting with reduce slots. If J1 is an
overall slower job, and you haven't configured
mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch
of idle reduce tasks which would starve J2.
In g
The map and reduce functions are running a different JVM, so they
never ran the main() method. You can implement a configure(JobConf
job) method in your map and reduce classes which will be passed the
JobConf you used to launch the job.
-Joey
On Wed, Sep 21, 2011 at 9:36 AM, pranjal shrivastava
Have you looked at hue (https://github.com/cloudera/hue)? It has a
web-based GUI file manager.
-Joey
On Tue, Sep 20, 2011 at 6:50 PM, Steve Lewis wrote:
> My dfs is a real mess and I am looking for a good gui fiile manager to allow
> me to clean it up
> deleting a lot of directories
> Anyone wri
FYI, I'm moving this to mapreduce-user@ and bccing common-user@.
It looks like your latest permission problem is on the local disk. What is your
setting for hadoop.tmp.dir? What are the permissions on that directory?
-Joey
On Sep 18, 2011, at 23:27, ArunKumar wrote:
> Hi guys !
>
> Commo
You can also use mrunit [1] to write unit tests against your MapReduce code.
-Joey
[1] http://incubator.apache.org/mrunit/
On Thu, Sep 15, 2011 at 1:18 AM, Subroto Sanyal wrote:
> **
>
> Hi,
>
> ** **
>
> MapReduce framework provide different in built approaches to debug a Job:*
> ***
>
> *
> before starting a cluster on EC2. :)
>
> Thanks for your time
>
> Marco
>
> On 14 September 2011 14:04, Joey Echeverria wrote:
>> When are you getting the exception? Is it during the setup of your
>> job, or after it's running on the cluster?
>>
&
To add to what Kevin said, you'll be writing a class that extends FileSystem.
-Joey
On Wed, Sep 14, 2011 at 1:08 PM, Kevin Burton wrote:
> You would probably have to implement your own Hadoop filesystem similar to
> S3 and KFS integrate.
> I looked at it a while back and it didn't seem insanely
When are you getting the exception? Is it during the setup of your
job, or after it's running on the cluster?
-Joey
On Wed, Sep 14, 2011 at 4:50 AM, Marco Didonna wrote:
> Hello everyone,
> sorry to bring this up again but I need some clarification. I wrote a
> map-reduce application that need c
> I tried it but it creates a binary file which i can not understand (i
>>>> need the result of the first job).
>>>> The other thing is how can i use this file in the next chained mapper?
>>>> i.e how can i retrieve the keys and the values in the map function?
>>&g
Have you tried SequenceFileOutputFormat and SequenceFileInputFormat?
-Joey
On Mon, Sep 5, 2011 at 11:49 AM, ilyal levin wrote:
> Hi
> I'm trying to write a chained mapreduce program. i'm doing so with a simple
> loop where in each iteration i
> create a job ,execute it and every time the current
Have you looked at hadoop pipes?
On Sep 5, 2011 6:01 AM, "seven garfee" wrote:
> I see there is a java version,but for some reasons,I need a c++ version.
> Anyone can help?
aken on compressed data instead of original data.
>
> -Original Message-
> From: Joey Echeverria [mailto:j...@cloudera.com]
> Sent: Friday, August 19, 2011 3:07 AM
> To: d...@hive.apache.org
> Subject: Wonky reduce progress
>
> I'm seeing really weird numbers in
You can set mapred.child.env on the JobConf before you submit the job.
If you want to add to the PATH, you can set to something like:
PATH=$PATH:/directory/with/dlls
-Joey
On Mon, Aug 8, 2011 at 5:04 PM, Curtis Jensen wrote:
> Is it possible to set the MS. Windows PATH environment variable for
Bhandarkar
>> Greenplum Labs, EMC
>> (Disclaimer: Opinions expressed in this email are those of the author, and
>> do
>> not necessarily represent the views of any organization, past or present,
>> the author might be affiliated with.)
>>
>>
>>
>>
Hadoop reuses objects as an optimization. If you need to keep a copy
in memory, you need to call clone yourself. I've never used Avro, but
my guess is that the BARs are not reused, only the FOO.
-Joey
On Wed, Aug 3, 2011 at 3:18 AM, Vyacheslav Zholudev
wrote:
> Hi all,
>
> I'm using Avro as a se
You're running out of memory trying to generate the splits. You need to set
a bigger heap for your driver program. Assuming you're using the hadoop jar
command to launch your job, you can do this by setting HADOOP_HEAPSIZE to a
larger value in $HADOOP_HOME/conf/hadoop-env.sh
-Joey
On Jul 24, 2011
ArrayWritables can't be deserialized because they don't encode the type of
the objects with the data. The solution is to sub-class ArrayWritable with
your specific type. In your case, you'd need to do this:
public class IntArraryWritable {
public IntArrayWritable() {
super(IntWritable.class
It depends on the versions. Some minor updates are compatible with each other
and you can do rolling restarts.
If you're using less than 50% of your total storage, you could decommission
half of your cluster, upgrade that half, distcp to the new cluster and then
upgrade the other half.
-Joey
This property applies to a tasktraker rather that an individual job.
Therefore it needs to be set in the mapred-site.xml and the daemon
restarted.
-Joey
On Jul 1, 2011 7:01 PM, wrote:
> Are you sure? AFAIK all mapred.xxx properties can be set via job config. I
also read on yahoo tutorial that thi
ile is on my disk(for example: D://test.seq),
> and how to write a java class to parse it?
>
> 2011/6/27 Joey Echeverria
>>
>> If the data is text you can always print out the sequence file using
>> this command:
>>
>> hadoop fs -text file:///my/directory
If the data is text you can always print out the sequence file using
this command:
hadoop fs -text file:///my/directory/file.seq
This will parse the sequence file, convert each key and value to a
string and print it to stdout. Notice the file:// in the path, that
will cause hadoop to access the l
> Now, in case that this input file is not split based on HDFS block but
> one-split per file. I will have in consequence only 1 mapper since I have
> only 1 input split. Where the computation of the mapper takes place? in
> machineA or machineB or machine C or in another machine inside the cluster
You could pipe 'yes' to the hadoop command:
yes | hadoop namenode -format
-Joey
On Wed, Jun 22, 2011 at 4:46 PM, Virajith Jalaparti
wrote:
> Hi,
>
> When I try to reformat HDFS (I have to multiple times for some experiment I
> need to run), it asks for a confirmation Y/N. Is there a way to disa
The only way to do that is to drop the setting down to one and bounce
the TaskTrackers.
-Joey
On Tue, Jun 21, 2011 at 12:52 PM, Jonathan Zukerman
wrote:
> Hi,
> Is there a way to set the maximum map tasks for all tasktrackers in my
> cluster for a certain job?
> Most of my tasktrackers are confi
Set your output format to the NullOutputFormat:
job.setOutputFormat(NullOutputFormat.class);
-Joey
On Jun 2, 2011, at 6:21, Pedro Costa wrote:
> What I meant in this question is put the processed result of the
> reduce task in something like /dev/null. How can I do that?
>
> On Thu, Jun 2, 2
They write directly to HDFS, there's no additional buffering on the
local file system of the client.
-Joey
On Tue, May 31, 2011 at 7:56 PM, Mapred Learn wrote:
> Hi guys,
> I asked this question earlier but did not get any response. So, posting
> again. Hope somebody can point to the right descr
Hi Karthik,
FYI, I'm moving this thread to mapreduce-user@hadoop.apache.org (You
and common-user are BCCed).
My guess is that your task trackers are throwing a lot of exceptions
which are getting logged. Can you send a snippet of the logs to help
diagnose why it's logging so much? Can you also le
Hi Mark,
FYI, I'm moving the discussion over to
mapreduce-user@hadoop.apache.org since your question is specific to
MapReduce.
You can derive the output name from the TaskAttemptID which you can
get by calling getTaskAttemptID() on the context passed to your
cleanup() funciton. The task attempt i
Look at getInputSplits() of SequenceFileInputFormat.
-Joey
On May 23, 2011 5:09 AM, "Vincent Xue" wrote:
> Hello Hadoop Users,
>
> I would like to know if anyone has ever tried splitting an input
> sequence file by key instead of by size. I know that this is unusual
> for the map reduce paradigm
Just last week I worked on a REST interface hosted in Tomcat that
launched a MR job. In my case, I included the jar with the job in the
WAR and called the run() method (the job implemented Tool). The only
tricky part is a copy of the Hadoop configuration files needed to be
in the classpath, but I j
You could write your own input format class to handle breaking out the
tar files for you. If you subclass FileInputFormat, Hadoop will handle
decompressing the files because of the .gz file extension. Your input
format would just need to use a Java tar file library (e.g.
http://code.google.com/p/jt
You need to add a call to MultipleOutputs.close() in your reducer's cleanup:
public void cleanup(Context) throws IOException {
mos.close();
...
}
On Fri, May 6, 2011 at 1:55 PM, Geoffry Roberts
wrote:
> All,
>
> I am attempting to take a large file and split it up into a series of
> smal
Hadoop uses an InputFormat class to parse files and generate key,
value pairs for your Mapper. An InputFormat is any class which extends
the base abstract class:
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapreduce/InputFormat.html
The default InputFormat parse text files
Just to confirm, you restarted hadoop after making the changes to
mapred-site.xml?
-Joey
On Fri, Apr 29, 2011 at 11:53 AM, Donatella Firmani
wrote:
> Hi Alex,
>
> I'm just editing mapred-site.xml in /conf directory of my hadoop
> installation root.
> I'm running in pseudo-distributed mode?
>
> S
It was initializing a 200MB buffer to do the sorting of the output in.
How much space did you allocate the task JVMs (mapred.child.java.opts
in mapred-site.xml)?
If you didn't change the default, it's set to 200MB which is why you
would run out of error trying to allocate a 200MB buffer.
-Joey
O
68 matches
Mail list logo