Re: JVM Spawning

2008-09-02 Thread Owen O'Malley
I posted an idea for an extension for MultipleFileInputFormat if someone has any extra time. *smile* https://issues.apache.org/jira/browse/HADOOP-4057 -- Owen

Re: Reading and writing Thrift data from MapReduce

2008-09-02 Thread Jeff Hammerbacher
Hey Juho, You should check out Hive (https://issues.apache.org/jira/browse/HADOOP-3601), which was just committed to the Hadoop trunk today. It's what we use at Facebook to query our collection of Thrift-serialized logfiles. Inside of the Hive code, you'll find a pure-Java (using JavaCC) parser fo

Re: JVM Spawning

2008-09-02 Thread Owen O'Malley
On Tue, Sep 2, 2008 at 9:13 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > I see... so there really isn't a way for me to test a map/reduce > program using a single node without incurring the overhead of > upping/downing JVM's... My input is broken up into 5 text files is > there a way I could

Re: JVM Spawning

2008-09-02 Thread Ryan LeCompte
I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing JVM's... My input is broken up into 5 text files is there a way I could start the job such that it only uses 1 map to process the whole thing? I guess I'

Re: JVM Spawning

2008-09-02 Thread Owen O'Malley
On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote: Beginner's question: If I have a cluster with a single node that has a max of 1 map/1 reduce, and the job submitted has 50 maps... Then it will process only 1 map at a time. Does that mean that it's spawning 1 new JVM for each map processed? Or

JVM Spawning

2008-09-02 Thread Ryan LeCompte
Beginner's question: If I have a cluster with a single node that has a max of 1 map/1 reduce, and the job submitted has 50 maps... Then it will process only 1 map at a time. Does that mean that it's spawning 1 new JVM for each map processed? Or re-using the same JVM when a new map can be processed

Re: Distributed Hadoop available on an OpenSolaris-based Live CD

2008-09-02 Thread Alex Loddengaard
Another good idea would be to create a VM image with Hadoop installed and configured for a single-node cluster. Just throwing that out there. Alex On Wed, Sep 3, 2008 at 8:37 AM, George Porter <[EMAIL PROTECTED]> wrote: > Hello, > > I'd like to announce the availability of an open-source "Live

Re: Output directory already exists

2008-09-02 Thread Owen O'Malley
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to write the output of two different map-reduce jobs into the > same output directory. I'm using MultipleOutputFormats to set the filename > dynamically, so there is no filename collision between the two

Re: Output directory already exists

2008-09-02 Thread Mafish Liu
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote: > Hi, > > I'm trying to write the output of two different map-reduce jobs into the > same output directory. I'm using MultipleOutputFormats to set the filename > dynamically, so there is no filename collision between the two

Distributed Hadoop available on an OpenSolaris-based Live CD

2008-09-02 Thread George Porter
Hello, I'd like to announce the availability of an open-source “Live CD” aimed at providing new users to Hadoop with a fully functional, pre- configured Hadoop cluster that is easy to start up and use and lets people get a quick look at what Hadoop offers in terms of power and ease of use.

Re: HDFS space utilization

2008-09-02 Thread Raghu Angadi
Is there anything else on the partition DFS data directory is located? IOW, what does 'df -k' show when you run it under 'dfs.data.dir'? Raghu. Victor Samoylov wrote: Hi, I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS) I have only 15 GB instead 22 GB, see following

Re: Hadoop & EC2

2008-09-02 Thread Russell Smith
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3 has the same premise of availability zones that EC2 has. Between different regions, data transfer is 1) charged for and 2) likely slower between EC2 and S3-Europe. Transfer between S3-US and EC2 is free of charge, and sh

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte
How can you ensure that the S3 buckets and EC2 instances belong to a certain zone? Ryan On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: > > On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: > >> Hi Tim, >> >> Are you mostly just processing/parsing textual log files? How many

Re: Hadoop & EC2

2008-09-02 Thread Karl Anderson
On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote: Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performan

Re: Hadoop & EC2

2008-09-02 Thread Michael Stoppelman
Tom White's blog has a nice piece on the different setups you can have for a hadoop cluster on EC2: http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html With the EBS volumes you can bring up and take down your cluster at will so you don't need to have 20 machines running all

Output directory already exists

2008-09-02 Thread Shirley Cohen
Hi, I'm trying to write the output of two different map-reduce jobs into the same output directory. I'm using MultipleOutputFormats to set the filename dynamically, so there is no filename collision between the two jobs. However, I'm getting the error "output directory already exists".

Re: Slaves "Hot-Swaping"

2008-09-02 Thread Camilo Gonzalez
Great! I will give it a try. Thanks for your email. On Tue, Sep 2, 2008 at 10:58 AM, Mikhail Yakshin <[EMAIL PROTECTED]>wrote: > On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote: > > I was wondering if there is a way to "Hot-Swap" Slave machines, for > example, > > in case an Slave machine

Re: Slaves "Hot-Swaping"

2008-09-02 Thread Allen Wittenauer
On 9/2/08 8:33 AM, "Camilo Gonzalez" <[EMAIL PROTECTED]> wrote: > I was wondering if there is a way to "Hot-Swap" Slave machines, for example, > in case an Slave machine fails while the Cluster is running and I want to > mount a new Slave machine to replace the old one, is there a way to tell t

Re: Hadoop & EC2

2008-09-02 Thread Andrzej Bialecki
tim robertson wrote: Incidentally, I have most of the basics of a "MapReduce-Lite" which I aim to port to use the exact Hadoop API since I am *only* working on 10's-100's GB of data and find that it is running really fine on my laptop and I don't need the distributed failover. My goal for that

Re: Slaves "Hot-Swaping"

2008-09-02 Thread Mikhail Yakshin
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote: > I was wondering if there is a way to "Hot-Swap" Slave machines, for example, > in case an Slave machine fails while the Cluster is running and I want to > mount a new Slave machine to replace the old one, is there a way to tell the > Master t

Slaves "Hot-Swaping"

2008-09-02 Thread Camilo Gonzalez
Hi! I was wondering if there is a way to "Hot-Swap" Slave machines, for example, in case an Slave machine fails while the Cluster is running and I want to mount a new Slave machine to replace the old one, is there a way to tell the Master that a new Slave machine is Online without having to stop a

Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-02 Thread Ryan LeCompte
Actually not if you're using the s3:// as opposed to s3n:// ... Thanks, Ryan On Tue, Sep 2, 2008 at 11:21 AM, James Moore <[EMAIL PROTECTED]> wrote: > On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I'm trying to upload a fairly large file (18GB or so) to

Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-02 Thread James Moore
On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello, > > I'm trying to upload a fairly large file (18GB or so) to my AWS S3 > account via bin/hadoop fs -put ... s3://... Isn't the maximum size of a file on s3 5GB? -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on R

HDFS space utilization

2008-09-02 Thread Victor Samoylov
Hi, I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS) I have only 15 GB instead 22 GB, see following live status of my nodes: *10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB / 992.31 MB (0%) * Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used :

Re: Reading and writing Thrift data from MapReduce

2008-09-02 Thread Stuart Sierra
On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <[EMAIL PROTECTED]> wrote: > What's the current status of Thrift with Hadoop? Is there any > documentation online or even some code in the SVN which I could look > into? I think you have two choices: 1) wrap your Thrift code in a class that implements W

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim, Thanks for responding -- I believe that I'll need the full power of Hadoop since I'll want this to scale well beyond 100GB of data. Thanks for sharing your experiences -- I'll definitely check out your blog. Thanks! Ryan On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]>

Re: Hadoop & EC2

2008-09-02 Thread tim robertson
Hi Ryan, I actually blogged my experience as it was my first usage of EC2: http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html My input data was not log files but actually a dump if 150million records from Mysql into about 13 columns of tab file data I believe. It was a

Re: Hadoop & EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim, Are you mostly just processing/parsing textual log files? How many maps/reduces did you configure in your hadoop-ec2-env.sh file? How many did you configure in your JobConf? Just trying to get an idea of what to expect in terms of performance. I'm noticing that it takes about 16 minutes to

Reading and writing Thrift data from MapReduce

2008-09-02 Thread Juho Mäkinen
We are already using Thrift to move and store our log data and I'm looking onto how I could read the stored log data into MapReduce processes. This article http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html talks about using Thrift for the IO, but it doesn't say anything speci

Re: Hadoop & EC2

2008-09-02 Thread tim robertson
I have been processing only 100s GBs on EC2, not 1000's and using 20 nodes and really only in exploration and testing phase right now. On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote: > Hi Ryan, > > Just a heads up, if you require more than the 20 node limit, Amazon > p