Using HOD to manage a production cluster

2009-01-31 Thread Edward Capriolo
I am looking at using HOD (Hadoop On Demand) to manage a production
cluster. After reading the documentation It seems that HOD is missing
some things that would need to be carefully set in a production
cluster.

Rack Locality:
HOD uses the -N 5 option and starts a cluster of N nodes. There seems
to be no way to pass specific options to them individually. How can I
make sure the set of servers selected selected will end up in
different racks?

data node blacklist/white list
These are listed in a file can that file be generated?

hadoop-env
Can I set my these options from HOD or do I have to build them into
the hadoop tar.

JMX settings
Can I set my these options from HOD or do I have to build them into
the hadoop tar.

Upgrade with non symmetric configurations:

Old servers > /mnt/drive1  /mnt/drive2
New Servers > /mnt/drive1  /mnt/drive2  /mnt/drive3

Can HOD ship out different configuration files to different nodes? As
new nodes are joining the cluster for an upgrade they may have
different configurations then the old one.

>From reading the docs it seems like HOD is great for building on
demand clusters, but may not be ideal for managing a single permanent
long term cluster.

Accidental cluster destruction.
Sounds silly but might the wrong command take out a cluster in one
swipe. Possibly block this feature.

Any thoughts?


HDFS Namenode Heap Size woes

2009-01-31 Thread Sean Knapp
I'm running 0.19.0 on a 10 node cluster (8 core, 16GB RAM, 4x1.5TB). The
current status of my FS is approximately 1 million files and directories,
950k blocks, and heap size of 7GB (16GB reserved). Average block replication
is 3.8. I'm concerned that the heap size is steadily climbing... a 7GB heap
is substantially higher per file that I have on a similar 0.18.2 cluster,
which has closer to a 1GB heap.
My typical usage model is 1) write a number of small files into HDFS (tens
or hundreds of thousands at a time), 2) archive those files, 3) delete the
originals. I've tried dropping the replication factor of the _index and
_masterindex files without much effect on overall heap size. While I had
trash enabled at one point, I've since disabled it and deleted the .Trash
folders.

On namenode startup, I get a massive number of the following lines in my log
file:
2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.processReport: block blk_-2389330910609345428_7332878 on
172.16.129.33:50010 size 798080 does not belong to any file.
2009-01-31 21:41:23,283 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.addToInvalidates: blk_-2389330910609345428 is added to invalidSet
of 172.16.129.33:50010

I suspect the original files may be left behind and causing the heap size
bloat. Is there any accounting mechanism to determine what is contributing
to my heap size?

Thanks,
Sean


Hadoop Requested to be on RCE

2009-01-31 Thread Brock Palen

Hello!
Sorry for polluting the list.

I am an HPC administrator, and on the side I have started a HPC podcast,
www.rce-cast.com

I am extending an offer to have Hadoop be one of the next shows.
If you want please contact me off list, the interview is over skype  
and takes between 30-60 minutes.


If you have any questions please let me know!
I hope to speak with you soon!


Brock Palen
www.umich.edu/~brockp
Center for Advanced Computing
bro...@umich.edu
(734)936-1985





Re: Finding longest path in a graph

2009-01-31 Thread Andrzej Bialecki

Mork0075 wrote:

In general you can find shotest paths with the algorithm of Diskstra in
O(m + n log n) where m is the number of edges and n the number of nodes.
If you modify the Diskstra you should also be able to calculate longest
paths (instead of cheacking if the new path is smaller then the old one,
you can check if its longer). As datastructure you need a priority
queue, which returns the "current longest node" instead of shortest.

For azyklic graphs there is some other algorithm, which first calculates
a topological sorting for all nodes. With the topo sort you can also
check if there are cycles. Check for cylce detection and DFS (depth
first search). After this you process the nodes in topo sort order. Then
you will check for each child node their distance and chose the
maximum(allChildren). This can be done in O(m+n)

Perhaps this helps.


Thank you, I think this confirms my understanding of the problem.

Currently the implementation I have executes in N + 2 steps, and if 
there are cycles with diameter less than N, the vertices that are part 
of the cycle get m*N distance metric, which helps to detect cycles 
(because all valid paths have the metric at most N). Based on the 
specific nature of the graph (URL-s) I break the cycles by selecting the 
shortest URL as the start, and removing the edge from the last 
predecessor, which then becomes the end.



--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Finding longest path in a graph

2009-01-31 Thread Mork0075
In general you can find shotest paths with the algorithm of Diskstra in
O(m + n log n) where m is the number of edges and n the number of nodes.
If you modify the Diskstra you should also be able to calculate longest
paths (instead of cheacking if the new path is smaller then the old one,
you can check if its longer). As datastructure you need a priority
queue, which returns the "current longest node" instead of shortest.

For azyklic graphs there is some other algorithm, which first calculates
a topological sorting for all nodes. With the topo sort you can also
check if there are cycles. Check for cylce detection and DFS (depth
first search). After this you process the nodes in topo sort order. Then
you will check for each child node their distance and chose the
maximum(allChildren). This can be done in O(m+n)

Perhaps this helps.

Andrzej Bialecki schrieb:
> Hi,
> 
> I'm looking for an advice. I need to process a directed graph encoded as
> a list of  pairs. The goal is to compute a list of longest
> paths in the graph. There is no guarantee that the graph is acyclic, so
> there should be some mechanism to detect cycles.
> 
> Currently I'm using a simple approach consisting of the following: I
> encode the graph as >, and
> extending the paths by one degree at a time. This means that in order to
> find the longest path of degree N it takes N + 1 map-reduce jobs.
> 
> Are you perhaps aware of a smarter way to do it? I would appreciate any
> pointers.
> 



Re: settin JAVA_HOME...

2009-01-31 Thread haizhou zhao
hi Sandy,
Every time I change the conf, i have to do the following to things:
1. kill all hadoop processes
2. manually delelte all the file under hadoop.tmp.dir
to make sure hadoop runs correctly, otherwise it wont work.

Is this cause'd by my using a JDK instead of sun java? and what do you mean
by "sun-java", please?

2009/1/31 Sandy 

> Hi Zander,
>
> Do not use jdk. Horrific things happen. You must use sun java in order to
> use hadoop.
>
> There are packages for sun java on the ubuntu repository. You can download
> these directly using apt-get. This will install java 6 on your system.
> Your JAVA_HOME line in hadoop-env.sh should look like:
> export JAVA_HOME=/usr/lib/jvm/java-6-sun
>
> Also, on the wiki, there is a guide for installing hadoop on ubuntu
> systems.
> I think you may find this helpful.
>
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
>
> All the best!
> -SM
>
>
> On Fri, Jan 30, 2009 at 4:33 PM, zander1013  wrote:
>
> >
> > i am installing "default-jdk" now. perhaps that was the problem. is this
> > the
> > right jdk?
> >
> >
> >
> > zander1013 wrote:
> > >
> > > cool!
> > >
> > > here is the output for those commands...
> > >
> > > a...@node0:~/Hadoop/hadoop-0.19.0$ which java
> > > /usr/bin/java
> > > a...@node0:~/Hadoop/hadoop-0.19.0$
> > > a...@node0:~/Hadoop/hadoop-0.19.0$ ls -l /usr/bin/java
> > > lrwxrwxrwx 1 root root 22 2009-01-29 18:03 /usr/bin/java ->
> > > /etc/alternatives/java
> > > a...@node0:~/Hadoop/hadoop-0.19.0$
> > >
> > > ... i will try and set JAVA_HOME=/etc/alternatives/java...
> > >
> > > thank you for helping...
> > >
> > > -zander
> > >
> > >
> > > Mark Kerzner-2 wrote:
> > >>
> > >> Oh, you have used my path to JDK, you need yours
> > >> do this
> > >>
> > >> which java
> > >> something like /usr/bin/java will come back
> > >>
> > >> then do
> > >> ls -l /usr/bin/java
> > >>
> > >> it will tell you where the link is to. There may be more redirections,
> > >> get
> > >> the real path to your JDK
> > >>
> > >> On Fri, Jan 30, 2009 at 4:09 PM, zander1013 
> > wrote:
> > >>
> > >>>
> > >>> okay,
> > >>>
> > >>> here is the section for conf/hadoop-env.sh...
> > >>>
> > >>> # Set Hadoop-specific environment variables here.
> > >>>
> > >>> # The only required environment variable is JAVA_HOME.  All others
> are
> > >>> # optional.  When running a distributed configuration it is best to
> > >>> # set JAVA_HOME in this file, so that it is correctly defined on
> > >>> # remote nodes.
> > >>>
> > >>> # The java implementation to use.  Required.
> > >>> # export JAVA_HOME=/usr/lib/j2sdk1.5-sun
> > >>> export JAVA_HOME=/usr/lib/jvm/default-java
> > >>>
> > >>> ...
> > >>>
> > >>> and here is what i got for output. i am trying to go through the
> > >>> tutorial
> > >>> at
> > >>> http://hadoop.apache.org/core/docs/current/quickstart.html
> > >>>
> > >>> here is the output...
> > >>>
> > >>> a...@node0:~/Hadoop/hadoop-0.19.0$ bin/hadoop jar
> hadoop-*-examples.jar
> > >>> grep
> > >>> input output 'dfs[a-z.]+'
> > >>> bin/hadoop: line 243: /usr/lib/jvm/default-java/bin/java: No such
> file
> > >>> or
> > >>> directory
> > >>> bin/hadoop: line 273: /usr/lib/jvm/default-java/bin/java: No such
> file
> > >>> or
> > >>> directory
> > >>> bin/hadoop: line 273: exec: /usr/lib/jvm/default-java/bin/java:
> cannot
> > >>> execute: No such file or directory
> > >>> a...@node0:~/Hadoop/hadoop-0.19.0$
> > >>>
> > >>> ...
> > >>>
> > >>> please advise...
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> Mark Kerzner-2 wrote:
> > >>> >
> > >>> > You set it in the conf/hadoop-env.sh file, with an entry like this
> > >>> > export JAVA_HOME=/usr/lib/jvm/default-java
> > >>> >
> > >>> > Mark
> > >>> >
> > >>> > On Fri, Jan 30, 2009 at 3:49 PM, zander1013 
> > >>> wrote:
> > >>> >
> > >>> >>
> > >>> >> hi,
> > >>> >>
> > >>> >> i am new to hadoop. i am trying to set it up for the first time as
> a
> > >>> >> single
> > >>> >> node cluster. at present the snag is that i cannot seem to find
> the
> > >>> >> correct
> > >>> >> path for setting the JAVA_HOME variable.
> > >>> >>
> > >>> >> i am using ubuntu 8.10. i have tried using "whereis java" and
> tried
> > >>> >> setting
> > >>> >> the variable to point to those places (except the dir where i have
> > >>> >> hadoop).
> > >>> >>
> > >>> >> please advise.
> > >>> >>
> > >>> >> -zander
> > >>> >> --
> > >>> >> View this message in context:
> > >>> >>
> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756240.html
> > >>> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>> >>
> > >>> >>
> > >>> >
> > >>> >
> > >>>
> > >>> --
> > >>> View this message in context:
> > >>> http://www.nabble.com/settin-JAVA_HOME...-tp21756240p21756569.html
> > >>> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> > >>>
> > >>>
> > >>
> > >>
> > >
> > >
> >
> > --
> > View this message in context:
> > ht