Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughranste...@apache.org wrote: John Martyniak wrote: When I run either of those on either of the two machines, it is trying to resolve against the DNS servers configured for the external addresses for the box. Here is the result Server:        

Re: Multiple NIC Cards

2009-06-09 Thread Edward Capriolo
Also if you are using a topology rack map, make sure you scripts responds correctly to every possible hostname or IP address as well. On Tue, Jun 9, 2009 at 1:19 PM, John Martyniakj...@avum.com wrote: It seems that this is the issue, as there several posts related to same topic but with no

Re: Monitoring hadoop?

2009-06-05 Thread Edward Capriolo
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelmanbbock...@cse.unl.edu wrote: Hey Anthony, Look into hooking your Hadoop system into Ganglia; this produces about 20 real-time statistics per node. Hadoop also does JMX, which hooks into more enterprise-y monitoring systems. Brian On Jun 5,

Re: Blocks amount is stuck in statistics

2009-05-25 Thread Edward Capriolo
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Ok, was too eager to report :). It got sorted out after some time. Regards. 2009/5/25 Stas Oskin stas.os...@gmail.com Hi. I just did an erase of large test folder with about 20,000 blocks, and created a new

Re: ssh issues

2009-05-22 Thread Edward Capriolo
Pankil, I used to be very confused by hadoop and SSH keys. SSH is NOT required. Each component can be started by hand. This gem of knowledge is hidden away in the hundreds of DIGG style articles entitled 'HOW TO RUN A HADOOP MULTI-MASTER CLUSTER!' The SSH keys are only required by the shell

Re: Optimal Filesystem (and Settings) for HDFS

2009-05-18 Thread Edward Capriolo
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%. With 1 TB disks we got 33 GB more usable space. Talk about instant savings! On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard a...@cloudera.com wrote: I believe Yahoo! uses ext3, though I know other people have said that XFS

Re: Linking against Hive in Hadoop development tree

2009-05-15 Thread Edward Capriolo
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball aa...@cloudera.com wrote: Hi all, For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition to uploading data into HDFS and using MapReduce to load/transform the data, I'd like to integrate more closely with Hive. Specifically,

Re: sub 60 second performance

2009-05-11 Thread Edward Capriolo
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon t...@cloudera.com wrote: In addition to Jason's suggestion, you could also see about setting some of Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, it should be easy to re-load it onto the cluster if it's lost, so even

Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread Edward Capriolo
2009/5/7 Jeff Hammerbacher ham...@cloudera.com: Hey, You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com If You want to use many small files, they

Cacti Templates for Hadoop

2009-05-06 Thread Edward Capriolo
For those of you that would like to graph the hadoop JMX variables with cacti I have created cacti templates and data input scripts. Currently the package gathers and graphs the following information from the NameNode: Blocks Total Files Total Capacity Used/Capacity Free Live Data Nodes/Dead Data

Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....

2009-05-05 Thread Edward Capriolo
'cloud computing' is a hot term. According to the definition provided by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. Hadoop is scalable, with HOD it is dynamically scalable. I do not think

Re: Getting free and used space

2009-05-02 Thread Edward Capriolo
You can also pull these variables from the name node, datanode with JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE and READ user can access this variable. On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Any idea if the getDiskStatus()

Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski spo...@gmail.com wrote: If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine. The only thing you have to do is to copy your hadoop produced csv

Re: Hadoop / MySQL

2009-04-29 Thread Edward Capriolo
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon t...@cloudera.com wrote: On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski spo...@gmail.comwrote: If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine.

Re: max value for a dataset

2009-04-21 Thread Edward Capriolo
method? On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using

Re: max value for a dataset

2009-04-20 Thread Edward Capriolo
, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com wrote: How do you identify that map task is ending within the map method? Is it possible to know which is the last call to map method? On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I jumped

max value for a dataset

2009-04-18 Thread Edward Capriolo
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read

Re: Using HDFS to serve www requests

2009-03-27 Thread Edward Capriolo
but does Sun's Lustre follow in the steps of Gluster then Yes. IMHO GlusterFS advertises benchmarks vs Luster. The main difference is that GlusterFS is a fuse (userspace filesystem) while Luster has to be patched into the kernel, or a module.

Re: virtualization with hadoop

2009-03-26 Thread Edward Capriolo
I use linux-vserver http://linux-vserver.org/ The Linux-VServer technology is a soft partitioning concept based on Security Contexts which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing

Re: Using HDFS to serve www requests

2009-03-26 Thread Edward Capriolo
It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform

Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin greycat.na@gmail.com wrote: Hi, Is anyone using Hadoop as more of a near/almost real-time processing of log data for their systems to aggregate stats, etc? We do, although near realtime is pretty relative subject and your mileage may

Re: Using Hadoop for near real-time processing of log data

2009-02-25 Thread Edward Capriolo
Yeah, but what's the point of using Hadoop then? i.e. we lost all the parallelism? Some jobs do not need it. For example, I am working with the Hive sub project. If I have a table that is less then my block size. Having a large number of mappers or reducers is counter productive. Hadoop will

Re: Batching key/value pairs to map

2009-02-23 Thread Edward Capriolo
We have a MR program that collects once for each token on a line. What types of applications can benefit from batch mapping?

Hadoop JMX

2009-02-20 Thread Edward Capriolo
I am working to graph the hadoop JMX variables. http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html I have a two nodes, one running 0.17 and the other running.0.19 The NameNode JMX objects and attributes seem to be working well. I am

Re: Pluggable JDBC schemas [Was: How to use DBInputFormat?]

2009-02-13 Thread Edward Capriolo
One thing to mention is 'limit' is not SQL standard. Microsoft SQL Server uses the SELECT TOP 100 FROM table. Some RDBMS may not support any such syntax. To be more SQL compliant you should use some data like an auto ID or DATE column for an offset. It is tricky to write anything truly database

Using HOD to manage a production cluster

2009-01-31 Thread Edward Capriolo
I am looking at using HOD (Hadoop On Demand) to manage a production cluster. After reading the documentation It seems that HOD is missing some things that would need to be carefully set in a production cluster. Rack Locality: HOD uses the -N 5 option and starts a cluster of N nodes. There seems

Re: Zeroconf for hadoop

2009-01-26 Thread Edward Capriolo
Zeroconf is more focused on simplicity then security. One of the original problems that may have been fixes is that any program can announce any service. IE my laptop can announce that it is the DNS for google.com etc. I want to mention a related topic to the list. People are approaching the

Re: Netbeans/Eclipse plugin

2009-01-25 Thread Edward Capriolo
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar vinaykat...@gmail.com wrote: Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want to make plugin for netbeans http://vinayakkatkar.wordpress.com -- Vinayak Katkar Sun Campus Ambassador Sun Microsytems,India COEP

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-23 Thread Edward Capriolo
I am looking to create some RA scripts and experiment with starting hadoop via linux-ha cluster manager. Linux HA would handle restarting downed nodes and eliminate the ssh key dependency.

Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-10 Thread Edward Capriolo
Also be careful when you do this. If you are running map/reduce on a large file the map and reduce operations will be called many times. You can end up with a lot of output. Use log4j instead.

Re: File loss at Nebraska

2008-12-09 Thread Edward Capriolo
Also it might be useful to strongly word hadoop-default.conf as many people might not know a downside exists for using 2 rather then 3 as the replication factor. Before reading this thread I would have thought 2 to be sufficient.

JDBC input/output format

2008-12-08 Thread Edward Capriolo
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking this would be very useful for sending data into mappers. Writing data to a relational database might be more application dependent but still possible.

Hadoop IP Tables configuration

2008-12-03 Thread Edward Capriolo
All, I always run iptables on my systems. Most of the hadoop setup guides I have found skip iptables/firewall configuration. My namenode and task tracker are the same node. My current configuration is not working as I submit jobs from the namenode jobs are kicked off on the slave nodes but they

Re: What do you do with task logs?

2008-11-18 Thread Edward Capriolo
We just setup a log4j server. This takes the logs off the cluster. Plus you get all the benefits of log4j http://timarcher.com/?q=node/10

Re: nagios to monitor hadoop datanodes!

2008-10-29 Thread Edward Capriolo
All I have to say is wow! I never tried jconsole before. I have hadoop_trunk checked out and the JMX has all kinds of great information. I am going to look at how I can get JMX/cacti/and hadoop working together. Just as an FYI there are separate ENV variables for each now. If you override

Re: LHadoop Server simple Hadoop input and output

2008-10-24 Thread Edward Capriolo
I came up with my line of thinking after reading this article: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data As a guy that was intrigued by the java coffee cup in 95, that now lives as a data center/noc jock/unix guy. Lets say I look at a log

Re: LHadoop Server simple Hadoop input and output

2008-10-23 Thread Edward Capriolo
I had downloaded thrift and ran the example applications after the Hive meet up. It is very cool stuff. The thriftfs interface is more elegant than what I was trying to do, and that implementation is more complete. Still, someone might be interested in what I did if they want a super-light API :)

Hive Web-UI

2008-10-10 Thread Edward Capriolo
I was checking out this slide show. http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/ in the diagram a Web-UI exists. This was the first I have heard of this. Is this part of or planned to be a part of contrib/hive? I think a web interface for showing table schema and

Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with

Re: nagios to monitor hadoop datanodes!

2008-10-08 Thread Edward Capriolo
That all sounds good. By 'quick hack' I meant 'check_tcp' was not good enough because an open TCP socket does not prove much. However, if the page returns useful attributes that show cluster is alive that is great and easy. Come to think of it you can navigate the dfshealth page and get useful

Re: Hadoop and security.

2008-10-06 Thread Edward Capriolo
You bring up some valid points. This would be a great topic for a white paper. The first line of defense should be to apply inbound and outbound iptables rules. Only source IPs that have a direct need to interact with the cluster should be allowed to. The same is true with the web access. Only a

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I am doing a lot of testing with Hive, I will be sure to add this information to the wiki once I get it going. Thus far I downloaded the same version of derby that hive uses. I have verified that the connections is up and running. ij version 10.4 ij connect

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
namehive.metastore.local/name valuetrue/value Why would I set this property to true? My goal is to store the meta data in an external database. It i set this to true the metabase is created in the working directory.

Re: Hive questions about the meta db

2008-10-02 Thread Edward Capriolo
I determined the problem once I set the log4j properties to debug. derbyclient.jar derbytools.jar does not ship with hive. As a result when you try to org.apache.derby.jdbc.ClientDriver you get an invocation target exception. The solution for this was to download the derby, and place those files

Re: text extraction from html based on uniqueness metric

2008-06-10 Thread Edward Capriolo
I have never tried this method. The concept came from a research paper I ran into. The goal was to detect the language of piece of text by looking at several factors. Average length of word, average length of sentence, average number of vowels in a word, etc. He used these to score and article,

Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Edward Capriolo
wait and sleep are not what you are looking for. you can use 'nohup' to run a job in the background and have its output piped to a file. On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao [EMAIL PROTECTED] wrote: I'm interested in the same thing -- is there a recommended way to batch Hadoop jobs

Re: Hadoop Distributed Virtualisation

2008-06-06 Thread Edward Capriolo
I once asked a wise man in change of a rather large multi-datacenter service, Have you every considered virtualization? He replied, All the CPU's here are pegged at 100% They may be applications for this type of processing. I have thought about systems like this from time to time. This thinking