Hi All,
Im new to hadoop and successfuly runs many times MapRed task on my
small cluster (6 machines).
Now I realizes that by default only 1 reducer assigned to the job. and
with only 1 reducer things going slow.
I've read some documents and about to increase the number of reducer.
Hadoop Definit
On Tue, May 18, 2010 at 2:50 PM, Jones, Nick wrote:
> I'm not familiar with how to use/create them, but shouldn't a HAR (Hadoop
> Archive) work well in this situation? I thought it was designed to collect
> several small files together through another level indirection to avoid the
> NN load and
I'm not familiar with how to use/create them, but shouldn't a HAR (Hadoop
Archive) work well in this situation? I thought it was designed to collect
several small files together through another level indirection to avoid the NN
load and decreasing the HDFS block size.
Nick Jones
-Original
That wasn't sarcasm. This is what you do:
- Run your mapreduce job on 30k small files.
- Consolidate your 30k small files into larger files.
- Run mapreduce ok the larger files.
- Compare the running time
The difference in runtime is made up by your task startup and seek overhead.
If you want to
Hm. I actually just changed to this version
Erik
On 18 May 2010 15:59, David Howell wrote:
> Are you using Cloudera's hadoop 0.20.2?
>
> There's some logic in bin/hadoop-config.sh that seems to be failing if
> JAVA_HOME isn't set, and it runs before hadoop-env.sh.
>
> If you think it might
Are you using Cloudera's hadoop 0.20.2?
There's some logic in bin/hadoop-config.sh that seems to be failing if
JAVA_HOME isn't set, and it runs before hadoop-env.sh.
If you think it might be the same problem, please weigh in:
http://getsatisfaction.com/cloudera/topics/java_home_setting_in_hadoop
Thanks for the sarcasm but with 3 small files and so, 3 Mapper
instatiations, even though it's not (and never did I say it was) he only
metric that matters, it seem to me lie something very interresting to check
out...
I have hierarchy over me and they will be happy to understand my choices
Hey Konstantin,
Interesting paper :)
One thing which I've been kicking around lately is "at what scale does the
file/directory paradigm break down?"
At some point, I think the human mind can no longer comprehend so many files
(certainly, I can barely organize the few thousand files on my lapt
Yes, we recommend at least one local directory and one NFS directory for
dfs.name.dir in production environments. This allows an up-to-date recovery
of NN metadata if the NN should fail. In future versions the BackupNode
functionality will move us one step closer to not needing NFS for production
d
Sorry to hijack but after following this thread, I had a related question to
the secondary location of dfs.name.dir.
Is the approach outlined below the preferred/suggested way to do this? Is this
people mean when they say, "stick it on NFS" ?
Thanks!
On May 17, 2010, at 11:14 PM, Todd Lipco
Stan,
See my comments inline.
Thanks, Hong
On May 18, 2010, at 8:44 AM, stan lee wrote:
Hi Guys,
I am trying to use compression to reduce the IO workload when trying
to run
a job but failed. I have several questions which needs your help.
For lzo compression, I found a guide
http://code.
Hey Scott,
If the node shows up in the dead nodes and the live nodes as you say, it's
definitely not even attempting to be decommissioned. If HDFS was attempting
decommissioning and you restart the namenode, then it would only show up in the
dead nodes list.
Another option is to just turn off
I had an experiment with block size of 10 bytes (sic!). This was _very_ slow
on NN side. Like writing 5 Mb was happening for 25 minutes or so :( No fun to
say the least...
On Tue, May 18, 2010 at 10:56AM, Konstantin Shvachko wrote:
> You can also get some performance numbers and answers to the blo
You can also get some performance numbers and answers to the block size dilemma
problem here:
http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
I remember some people were using Hadoop for storing or streaming videos.
Don't know how well that worked.
It would b
Preserved JobTracker history is already available at /jobhistory.jsp
There is a link at the end of the /jobtracker.jsp page that leads to
this. There's also free analysis to go with that! :)
On Tue, May 18, 2010 at 11:00 PM, Alan Miller wrote:
> Hi,
>
> Is there a way to preserve previous job in
Hi stan,
You can do something of this sort if you use FileOutputFormat, from
within your Job Driver:
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);
// GzipCodec from org.apache.hadoop.io.compress.
// and where 'job'
Hi All,
I continually get this error when trying to run start-all.sh for hadoop
0.20.2 on ubuntu. What confuses me is I DO have JAVA_HOME set in
hadoop-env.sh to /usr/lib/jvm/jdk1.6.0_17. I've double checked to see that
JAVA_HOME is set to this by echoing the path before running the start script
b
Dfsadmin -report reports the hostname for that machine and not the ip. That
machine happens to be the master node which is why I am trying to
decommission the data node there since I only want the data node running on
the slave nodes. Dfs admin -report reports all the ips for the slave nodes.
One
Hi Scott,
You might be hitting two different issues.
1) Decommission not finishing.
https://issues.apache.org/jira/browse/HDFS-694 explains decommission
never finishing due to open files in 0.20
2) Nodes showing up both in live and dead nodes.
I remember Suresh taking a look at this.
32bit liblzo2 isn't needed on 64-bit systems.
On Tue, May 18, 2010 at 8:44 AM, stan lee wrote:
> Hi Guys,
>
> I am trying to use compression to reduce the IO workload when trying to run
> a job but failed. I have several questions which needs your help.
>
> For lzo compression, I found a guide
>
Hi Guys,
I am trying to use compression to reduce the IO workload when trying to run
a job but failed. I have several questions which needs your help.
For lzo compression, I found a guide
http://code.google.com/p/hadoop-gpl-compression/wiki/FAQ, why it said "Note
that you must have both 32-bit an
Hey Hassan,
1) The overhead is pretty small, measured in a small number of milliseconds on
average
2) HDFS is not designed for "online latency". Even though the average is
small, if something "bad happens", your clients might experience a lot of
delays while going through the retry stack. The
Thanks PanFeng, do you have more detailed explanation on this? Is it
caculated by how many reduce files has completed each phase?
Also, what's the answer for my second question? Thanks!
On Mon, May 17, 2010 at 12:44 PM, 原攀峰 wrote:
> For a reduce task, the execution is divided into three phases,
This is a very interesting thread to us, as we are thinking about deploying
HDFS as a massive online storage for a on online university, and then
serving the video files to students who want to view them.
We cannot control the size of the videos (and some class work files), as
they will mostly be
If you know how to use AspectJ to do aspect oriented programming. You can
write a aspect class. Let it just monitors the whole process of MapReduce
On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles wrote:
> Should be evident in the total job running time... that's the only metric
> that really ma
Should be evident in the total job running time... that's the only metric
that really matters :)
On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT wrote:
> Thank you,
> Any way I can measure the startup overhead in terms of time?
>
>
> On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles >wrote:
>
>
Thank you,
Any way I can measure the startup overhead in terms of time?
On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles wrote:
> Pierre,
>
> Adding to what Brian has said (some things are not explicitly mentioned in
> the HDFS design doc)...
>
> - If you have small files that take up < 64MB you
Pierre,
Adding to what Brian has said (some things are not explicitly mentioned in
the HDFS design doc)...
- If you have small files that take up < 64MB you do not actually use the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents meta-data
that needs to b
Okay, thank you :)
On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman wrote:
>
> On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
>
> > Hi, thanks for this fast answer :)
> > If so, what do you mean by blocks? If a file has to be splitted, it will
> be
> > splitted when larger than 64MB?
> >
>
>
On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
> Hi, thanks for this fast answer :)
> If so, what do you mean by blocks? If a file has to be splitted, it will be
> splitted when larger than 64MB?
>
For every 64MB of the file, Hadoop will create a separate block. So, if you
have a 32KB fil
Hey Scott,
Hadoop tends to get confused by nodes with multiple hostnames or multiple IP
addresses. Is this your case?
I can't remember precisely what our admin does, but I think he puts in the IP
address which Hadoop listens on in the exclude-hosts file.
Look in the output of
hadoop dfsadmi
... and by slices of 64MB then I mean...
?
On Tue, May 18, 2010 at 2:38 PM, Pierre ANCELOT wrote:
> Hi, thanks for this fast answer :)
> If so, what do you mean by blocks? If a file has to be splitted, it will be
> splitted when larger than 64MB?
>
>
>
>
>
> On Tue, May 18, 2010 at 2:34 PM, Bria
Hi, thanks for this fast answer :)
If so, what do you mean by blocks? If a file has to be splitted, it will be
splitted when larger than 64MB?
On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman wrote:
> Hey Pierre,
>
> These are not traditional filesystem blocks - if you save a file smaller
> th
Hey Pierre,
These are not traditional filesystem blocks - if you save a file smaller than
64MB, you don't lose 64MB of file space..
Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or so),
not 64MB.
Brian
On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
> Hi,
> I'm port
Hi,
I'm porting a legacy application to hadoop and it uses a bunch of small
files.
I'm aware that having such small files ain't a good idea but I'm not doing
the technical decisions and the port has to be done for yesterday...
Of course such small files are a problem, loading 64MB blocks for a few
Hi all,
I've picked up where Johan left off with the HUGUK meetups and the next one is
planned for June 3rd. The main talks will be:
“Introduction to Sqoop” by Aaron Kimball (Cloudera)
“Hive at Last.fm” by Tim Sell (Last.fm)
More details are available at: http://dumbotics.com/2010/05/18/huguk-4
36 matches
Mail list logo