On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughranste...@apache.org wrote:
John Martyniak wrote:
When I run either of those on either of the two machines, it is trying to
resolve against the DNS servers configured for the external addresses for
the box.
Here is the result
Server:
Also if you are using a topology rack map, make sure you scripts
responds correctly to every possible hostname or IP address as well.
On Tue, Jun 9, 2009 at 1:19 PM, John Martyniakj...@avum.com wrote:
It seems that this is the issue, as there several posts related to same
topic but with no
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelmanbbock...@cse.unl.edu wrote:
Hey Anthony,
Look into hooking your Hadoop system into Ganglia; this produces about 20
real-time statistics per node.
Hadoop also does JMX, which hooks into more enterprise-y monitoring
systems.
Brian
On Jun 5,
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin stas.os...@gmail.com wrote:
Hi.
Ok, was too eager to report :).
It got sorted out after some time.
Regards.
2009/5/25 Stas Oskin stas.os...@gmail.com
Hi.
I just did an erase of large test folder with about 20,000 blocks, and
created a new
Pankil,
I used to be very confused by hadoop and SSH keys. SSH is NOT
required. Each component can be started by hand. This gem of knowledge
is hidden away in the hundreds of DIGG style articles entitled 'HOW TO
RUN A HADOOP MULTI-MASTER CLUSTER!'
The SSH keys are only required by the shell
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%.
With 1 TB disks we got 33 GB more usable space. Talk about instant
savings!
On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard a...@cloudera.com wrote:
I believe Yahoo! uses ext3, though I know other people have said that XFS
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball aa...@cloudera.com wrote:
Hi all,
For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
to uploading data into HDFS and using MapReduce to load/transform the data,
I'd like to integrate more closely with Hive. Specifically,
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon t...@cloudera.com wrote:
In addition to Jason's suggestion, you could also see about setting some of
Hadoop's directories to subdirs of /dev/shm. If the dataset is really small,
it should be easy to re-load it onto the cluster if it's lost, so even
2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
Hey,
You can read more about why small files are difficult for HDFS at
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
Regards,
Jeff
2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
If You want to use many small files, they
For those of you that would like to graph the hadoop JMX variables
with cacti I have created cacti templates and data input scripts.
Currently the package gathers and graphs the following information
from the NameNode:
Blocks Total
Files Total
Capacity Used/Capacity Free
Live Data Nodes/Dead Data
'cloud computing' is a hot term. According to the definition provided
by wikipedia http://en.wikipedia.org/wiki/Cloud_computing,
Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well.
Hadoop is scalable, with HOD it is dynamically scalable.
I do not think
You can also pull these variables from the name node, datanode with
JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE
and READ user can access this variable.
On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin stas.os...@gmail.com wrote:
Hi.
Any idea if the getDiskStatus()
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski spo...@gmail.com wrote:
If you have trouble loading your data into mysql using INSERTs or LOAD
DATA, consider that MySQL supports CSV directly using the CSV storage
engine. The only thing you have to do is to copy your hadoop produced
csv
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon t...@cloudera.com wrote:
On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski spo...@gmail.comwrote:
If you have trouble loading your data into mysql using INSERTs or LOAD
DATA, consider that MySQL supports CSV directly using the CSV storage
engine.
method?
On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo
edlinuxg...@gmail.com
wrote:
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.
The best solution I came up with was using
, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com
wrote:
How do you identify that map task is ending within the map method? Is
it
possible to know which is the last call to map method?
On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo
edlinuxg...@gmail.com
wrote:
I jumped
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase
support the ability to max(). I am writing my own max() over a simple
one column dataset.
The best solution I came up with was using MapRunner. With maprunner I
can store the highest value in a private member variable. I can read
but does Sun's Lustre follow in the steps of Gluster then
Yes. IMHO GlusterFS advertises benchmarks vs Luster.
The main difference is that GlusterFS is a fuse (userspace filesystem)
while Luster has to be patched into the kernel, or a module.
I use linux-vserver http://linux-vserver.org/
The Linux-VServer technology is a soft partitioning concept based on
Security Contexts which permits the creation of many independent
Virtual Private Servers (VPS) that run simultaneously on a single
physical server at full speed, efficiently sharing
It is a little more natural to connect to HDFS from apache tomcat.
This will allow you to skip the FUSE mounts and just use the HDFS-API.
I have modified this code to run inside tomcat.
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
I will not testify to how well this setup will perform
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin
greycat.na@gmail.com wrote:
Hi,
Is anyone using Hadoop as more of a near/almost real-time processing
of log data for their systems to aggregate stats, etc?
We do, although near realtime is pretty relative subject and your
mileage may
Yeah, but what's the point of using Hadoop then? i.e. we lost all the
parallelism?
Some jobs do not need it. For example, I am working with the Hive sub
project. If I have a table that is less then my block size. Having a
large number of mappers or reducers is counter productive. Hadoop will
We have a MR program that collects once for each token on a line. What
types of applications can benefit from batch mapping?
I am working to graph the hadoop JMX variables.
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html
I have a two nodes, one running 0.17 and the other running.0.19
The NameNode JMX objects and attributes seem to be working well. I am
One thing to mention is 'limit' is not SQL standard. Microsoft SQL
Server uses the SELECT TOP 100 FROM table. Some RDBMS may not support
any such syntax. To be more SQL compliant you should use some data
like an auto ID or DATE column for an offset. It is tricky to write
anything truly database
I am looking at using HOD (Hadoop On Demand) to manage a production
cluster. After reading the documentation It seems that HOD is missing
some things that would need to be carefully set in a production
cluster.
Rack Locality:
HOD uses the -N 5 option and starts a cluster of N nodes. There seems
Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.
I want to mention a related topic to the list. People are approaching
the
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar vinaykat...@gmail.com wrote:
Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want
to make plugin for netbeans
http://vinayakkatkar.wordpress.com
--
Vinayak Katkar
Sun Campus Ambassador
Sun Microsytems,India
COEP
I am looking to create some RA scripts and experiment with starting
hadoop via linux-ha cluster manager. Linux HA would handle restarting
downed nodes and eliminate the ssh key dependency.
Also be careful when you do this. If you are running map/reduce on a
large file the map and reduce operations will be called many times.
You can end up with a lot of output. Use log4j instead.
Also it might be useful to strongly word hadoop-default.conf as many
people might not know a downside exists for using 2 rather then 3 as
the replication factor. Before reading this thread I would have
thought 2 to be sufficient.
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking
this would be very useful for sending data into mappers. Writing data
to a relational database might be more application dependent but still
possible.
All,
I always run iptables on my systems. Most of the hadoop setup guides I
have found skip iptables/firewall configuration. My namenode and task
tracker are the same node. My current configuration is not working as
I submit jobs from the namenode jobs are kicked off on the slave nodes
but they
We just setup a log4j server. This takes the logs off the cluster.
Plus you get all the benefits of log4j
http://timarcher.com/?q=node/10
All I have to say is wow! I never tried jconsole before. I have
hadoop_trunk checked out and the JMX has all kinds of great
information. I am going to look at how I can get JMX/cacti/and hadoop
working together.
Just as an FYI there are separate ENV variables for each now. If you
override
I came up with my line of thinking after reading this article:
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
As a guy that was intrigued by the java coffee cup in 95, that now
lives as a data center/noc jock/unix guy. Lets say I look at a log
I had downloaded thrift and ran the example applications after the
Hive meet up. It is very cool stuff. The thriftfs interface is more
elegant than what I was trying to do, and that implementation is more
complete.
Still, someone might be interested in what I did if they want a
super-light API :)
I was checking out this slide show.
http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/
in the diagram a Web-UI exists. This was the first I have heard of
this. Is this part of or planned to be a part of contrib/hive? I think
a web interface for showing table schema and
The simple way would be use use nrpe and check_proc. I have never
tested, but a command like 'ps -ef | grep java | grep NameNode' would
be a fairly decent check. That is not very robust but it should let
you know if the process is alive.
You could also monitor the web interfaces associated with
That all sounds good. By 'quick hack' I meant 'check_tcp' was not
good enough because an open TCP socket does not prove much. However,
if the page returns useful attributes that show cluster is alive that
is great and easy.
Come to think of it you can navigate the dfshealth page and get useful
You bring up some valid points. This would be a great topic for a
white paper. The first line of defense should be to apply inbound and
outbound iptables rules. Only source IPs that have a direct need to
interact with the cluster should be allowed to. The same is true with
the web access. Only a
I am doing a lot of testing with Hive, I will be sure to add this
information to the wiki once I get it going.
Thus far I downloaded the same version of derby that hive uses. I have
verified that the connections is up and running.
ij version 10.4
ij connect
namehive.metastore.local/name
valuetrue/value
Why would I set this property to true? My goal is to store the meta
data in an external database. It i set this to true the metabase is
created in the working directory.
I determined the problem once I set the log4j properties to debug.
derbyclient.jar derbytools.jar does not ship with hive. As a result
when you try to org.apache.derby.jdbc.ClientDriver you get an
invocation target exception.
The solution for this was to download the derby, and place those files
I have never tried this method. The concept came from a research paper
I ran into. The goal was to detect the language of piece of text by
looking at several factors. Average length of word, average length of
sentence, average number of vowels in a word, etc. He used these to
score and article,
wait and sleep are not what you are looking for. you can use 'nohup'
to run a job in the background and have its output piped to a file.
On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao [EMAIL PROTECTED] wrote:
I'm interested in the same thing -- is there a recommended way to batch
Hadoop jobs
I once asked a wise man in change of a rather large multi-datacenter
service, Have you every considered virtualization? He replied, All
the CPU's here are pegged at 100%
They may be applications for this type of processing. I have thought
about systems like this from time to time. This thinking
47 matches
Mail list logo