hey..!!
I have a question.
If I copy some file on HDFS file system, it will get split into blocks and
Namenode will keep all these meta info with it.
How can I see that info.
I copied 5 GB file on NameNode, but I see that file only on the NameNode..
It doesnot get split into blocks..??
How can I s
Thanks Robert, Niels
Ye, I think text manipulation, especially ngram is a good application for
me.
Cheers
On Fri, May 20, 2011 at 12:57 AM, Robert Evans wrote:
> I'm not sure if this has been mentioned or not but in Machine Learning with
> text based documents, the first stage is often a glorif
I came up with a nice little hack to trick hadoop into calculating disk
usage with df instead of du
http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/
I am running this in production, works like a charm and already
seeing benefit, woot!
I hope it wor
Hi Eduardo,
Sounds like you've configured your dfs.name.dirs to be on HDFS instead
of like file paths.
-Todd
On Fri, May 20, 2011 at 2:20 PM, Eduardo Dario Ricci wrote:
> Hy People
>
> I'm starting in hadoop commom.. and got some problem to try using a
> cluster..
>
> I'm following the steps of
Hy People
I'm starting in hadoop commom.. and got some problem to try using a
cluster..
I'm following the steps of this page:
http://hadoop.apache.org/common/docs/r0.21.0/cluster_setup.html
I done everything, but when I will format the HDFS, this error happens:
I searched for something to help-
Thank you, Kai and Joey, for the explanation. That's what I thought about
them, but did not want to miss the "magical" replacement for a central
services in the counters. No, there is no magic, just great reality.
Mark
On Fri, May 20, 2011 at 12:39 PM, Kai Voigt wrote:
> Also, with speculative
Also, with speculative execution enabled, you might see a higher count as you
expect while the same task is running multiple times in parallel. When a task
gets killed because another instance was quicker, those counters will be
removed from the global count though.
Kai
Am 20.05.2011 um 19:34
No.
Are you storing the data in sequence files?
-Joey
On Fri, May 20, 2011 at 10:33 AM, W.P. McNeill wrote:
> The keys are Text and the values are large custom data structures serialized
> with Avro.
>
> I also have counters for the job that generates these files that gives me
> this information but
Counters are a way to get status from your running job. They don't
increment a global state. They locally save increments and
periodically report those increments to the central counter. That
means that the final count will be correct, but you can't use them to
coordinate counts while your job is r
The keys are Text and the values are large custom data structures serialized
with Avro.
I also have counters for the job that generates these files that gives me
this information but sometimes...Well, it's a long story. Suffice to say
that it's nice to have a post-hoc method too. :-)
The identi
Joey,
You understood me perfectly well. I see your first advice, but I am not
allowed to have gaps. A central service is something I may consider if
single reducer becomes a worse bottleneck than it.
But what are counters for? They seem to be exactly that.
Mark
On Fri, May 20, 2011 at 12:01 PM,
To make sure I understand you correctly, you need a globally unique
one up counter for each output record?
If you had an upper bound on the number of records a single reducer
could output and you can afford to have gaps, you could just use the
task id and multiply that by the max number of records
The cheapest way would be to check the counters as you write them in
the first place and keep a running score. :)
Sent from my mobile. Please excuse the typos.
On 2011-05-20, at 10:35 AM, "W.P. McNeill" wrote:
> I've got a directory with a bunch of MapReduce data in it. I want to know
> how ma
What format is the input data in?
At first glance, I would run an identity mapper and use a
NullOutputFormat so you don't get any data written. The built in
counters already count the number of key, value pairs read in by the
mappers.
-Joey
On Fri, May 20, 2011 at 9:34 AM, W.P. McNeill wrote:
>
Hi, can I use a Counter to give each record in all reducers a consecutive
number? Currently I am using a single Reducer, but it is an anti-pattern.
But I need to assign consecutive numbers to all output records in all
reducers, and it does not matter how, as long as each gets its own number.
If it
I've got a directory with a bunch of MapReduce data in it. I want to know
how many pairs it contains. I could write a mapper-only
process that takes pairs as input and updates a
counter, but it seems like this utility should already exist. Does it, or
do I have to roll my own?
Bonus question,
I thought it was, because of FileBytesWritten counter. Thanks for the
clarification.
Mark
On Fri, May 20, 2011 at 4:23 AM, Harsh J wrote:
> Mark,
>
> On Fri, May 20, 2011 at 10:17 AM, Mark question
> wrote:
> > This is puzzling me ...
> >
> > With a mapper producing output of size ~ 400 MB ...
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am trying to get jvm metrics from the new verison of hadoop.
I have read the migration instructions and come up with the following
content for hadoop-metrics2.properties:
*.sink.file.class=org.apache.hadoop.metrics2.sink.FileSink
jvm.sink.file.p
On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:
> What do you mean clunky?
> IMHO this is a quite elegant, simple, working solution.
Try giving it to a user; watch them feed it a list of 10,000 files; watch the
machine swap to death and the disks uselessly thrash.
> Sure this spawns multi
Hi Praveenesh,
* You can set the maximum amount of reducers per node in your mapred-site.xml
using mapred.tasktracker.reduce.tasks.maximum (default set to 2).
* You can set the default number of reduce tasks with mapred.reduce.tasks
(default set to 1 - this causes your single reducer).
* Your jo
Hi,
My mapred-site.xml is pretty simple.
mapred.job.tracker
ub13:54311
The host and port that the MapReduce job tracker runs at. If
"local", then jobs are run in-process as a single map and reduce task.
where I should put the settings that you are saying ??
On Fri, May 20, 2011 at 6:
what does your mapred-site.xml file say?
I've used wordcount and had close to 12 reduces running on a 6
datanode cluster on a 3 GB file.
I have a configuration in there which says:
mapred.reduce.tasks = 12
The reason I chose 12 was because it was recommended that I choose 2x
number of tasktrack
I am using the wordcount example that comes along with hadoop.
How can I configure it to make it use multiple reducers.
I guess mutiple reducers will make it run more fast .. Does it ??
On Fri, May 20, 2011 at 6:51 PM, James Seigel Tynt wrote:
> The job could be designed to use one reducer
>
>
The job could be designed to use one reducer
On 2011-05-20, at 7:19 AM, praveenesh kumar wrote:
> Hello everyone,
>
> I am using wordcount application to test on my hadoop cluster of 5 nodes.
> The file size is around 5 GB.
> Its taking around 2 min - 40 sec for execution.
> But when I am check
Hello everyone,
I am using wordcount application to test on my hadoop cluster of 5 nodes.
The file size is around 5 GB.
Its taking around 2 min - 40 sec for execution.
But when I am checking the JobTracker web portal, I am seeing only one
reducer is running. Why so ??
How can I change the code so
Mark,
On Fri, May 20, 2011 at 10:17 AM, Mark question wrote:
> This is puzzling me ...
>
> With a mapper producing output of size ~ 400 MB ... which one is supposed
> to be faster?
>
> 1) output collector: which will write to local file then copy to HDFS since
> I don't have reducers.
A regula
What do you mean clunky?
IMHO this is a quite elegant, simple, working solution.
Sure this spawns multiple processes, but it beats any
api-overcomplications, imho.
Dieter
On Wed, 18 May 2011 11:39:36 -0500
Patrick Angeles wrote:
> kinda clunky but you could do this via shell:
>
> for $FILE in
28 matches
Mail list logo