I posted an idea for an extension for MultipleFileInputFormat if someone has
any extra time. *smile*
https://issues.apache.org/jira/browse/HADOOP-4057
-- Owen
Hey Juho,
You should check out Hive
(https://issues.apache.org/jira/browse/HADOOP-3601), which was just
committed to the Hadoop trunk today. It's what we use at Facebook to
query our collection of Thrift-serialized logfiles. Inside of the Hive
code, you'll find a pure-Java (using JavaCC) parser fo
On Tue, Sep 2, 2008 at 9:13 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> I see... so there really isn't a way for me to test a map/reduce
> program using a single node without incurring the overhead of
> upping/downing JVM's... My input is broken up into 5 text files is
> there a way I could
I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing JVM's... My input is broken up into 5 text files is
there a way I could start the job such that it only uses 1 map to
process the whole thing? I guess I'
On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote:
Beginner's question:
If I have a cluster with a single node that has a max of 1 map/1
reduce, and the job submitted has 50 maps... Then it will process only
1 map at a time. Does that mean that it's spawning 1 new JVM for each
map processed? Or
Beginner's question:
If I have a cluster with a single node that has a max of 1 map/1
reduce, and the job submitted has 50 maps... Then it will process only
1 map at a time. Does that mean that it's spawning 1 new JVM for each
map processed? Or re-using the same JVM when a new map can be
processed
Another good idea would be to create a VM image with Hadoop installed and
configured for a single-node cluster. Just throwing that out there.
Alex
On Wed, Sep 3, 2008 at 8:37 AM, George Porter <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I'd like to announce the availability of an open-source "Live
On Tue, Sep 2, 2008 at 10:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm trying to write the output of two different map-reduce jobs into the
> same output directory. I'm using MultipleOutputFormats to set the filename
> dynamically, so there is no filename collision between the two
On Wed, Sep 3, 2008 at 1:24 AM, Shirley Cohen <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'm trying to write the output of two different map-reduce jobs into the
> same output directory. I'm using MultipleOutputFormats to set the filename
> dynamically, so there is no filename collision between the two
Hello,
I'd like to announce the availability of an open-source “Live CD”
aimed at providing new users to Hadoop with a fully functional, pre-
configured Hadoop cluster that is easy to start up and use and lets
people get a quick look at what Hadoop offers in terms of power and
ease of use.
Is there anything else on the partition DFS data directory is located?
IOW, what does 'df -k' show when you run it under 'dfs.data.dir'?
Raghu.
Victor Samoylov wrote:
Hi,
I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS)
I have only 15 GB instead 22 GB, see following
I assume that Karl means 'regions' - i.e. Europe or US. I don't think S3
has the same premise of availability zones that EC2 has.
Between different regions, data transfer is 1) charged for and 2) likely
slower between EC2 and S3-Europe.
Transfer between S3-US and EC2 is free of charge, and sh
How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?
Ryan
On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson <[EMAIL PROTECTED]> wrote:
>
> On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:
>
>> Hi Tim,
>>
>> Are you mostly just processing/parsing textual log files? How many
On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:
Hi Tim,
Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performan
Tom White's blog has a nice piece on the different setups you can have for a
hadoop cluster on EC2:
http://www.lexemetech.com/2008/08/elastic-hadoop-clusters-with-amazons.html
With the EBS volumes you can bring up and take down your cluster at will so
you don't need to have 20 machines running all
Hi,
I'm trying to write the output of two different map-reduce jobs into
the same output directory. I'm using MultipleOutputFormats to set the
filename dynamically, so there is no filename collision between the
two jobs. However, I'm getting the error "output directory already
exists".
Great! I will give it a try.
Thanks for your email.
On Tue, Sep 2, 2008 at 10:58 AM, Mikhail Yakshin
<[EMAIL PROTECTED]>wrote:
> On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote:
> > I was wondering if there is a way to "Hot-Swap" Slave machines, for
> example,
> > in case an Slave machine
On 9/2/08 8:33 AM, "Camilo Gonzalez" <[EMAIL PROTECTED]> wrote:
> I was wondering if there is a way to "Hot-Swap" Slave machines, for example,
> in case an Slave machine fails while the Cluster is running and I want to
> mount a new Slave machine to replace the old one, is there a way to tell t
tim robertson wrote:
Incidentally, I have most of the basics of a "MapReduce-Lite" which I
aim to port to use the exact Hadoop API since I am *only* working on
10's-100's GB of data and find that it is running really fine on my
laptop and I don't need the distributed failover. My goal for that
On Tue, Sep 2, 2008 at 7:33 PM, Camilo Gonzalez wrote:
> I was wondering if there is a way to "Hot-Swap" Slave machines, for example,
> in case an Slave machine fails while the Cluster is running and I want to
> mount a new Slave machine to replace the old one, is there a way to tell the
> Master t
Hi!
I was wondering if there is a way to "Hot-Swap" Slave machines, for example,
in case an Slave machine fails while the Cluster is running and I want to
mount a new Slave machine to replace the old one, is there a way to tell the
Master that a new Slave machine is Online without having to stop a
Actually not if you're using the s3:// as opposed to s3n:// ...
Thanks,
Ryan
On Tue, Sep 2, 2008 at 11:21 AM, James Moore <[EMAIL PROTECTED]> wrote:
> On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>> Hello,
>>
>> I'm trying to upload a fairly large file (18GB or so) to
On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hello,
>
> I'm trying to upload a fairly large file (18GB or so) to my AWS S3
> account via bin/hadoop fs -put ... s3://...
Isn't the maximum size of a file on s3 5GB?
--
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on R
Hi,
I ran 3 data-nodes as HDFS and saw that at the beginning (no files in HDFS)
I have only 15 GB instead 22 GB, see following live status of my nodes:
*10 files and directories, 0 blocks = 10 total. Heap Size is 5.21 MB /
992.31 MB (0%)
* Capacity : 22.64 GB DFS Remaining : 15.42 GB DFS Used :
On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen <[EMAIL PROTECTED]> wrote:
> What's the current status of Thrift with Hadoop? Is there any
> documentation online or even some code in the SVN which I could look
> into?
I think you have two choices: 1) wrap your Thrift code in a class that
implements W
Hi Tim,
Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.
Thanks!
Ryan
On Tue, Sep 2, 2008 at 8:47 AM, tim robertson <[EMAIL PROTECTED]>
Hi Ryan,
I actually blogged my experience as it was my first usage of EC2:
http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html
My input data was not log files but actually a dump if 150million
records from Mysql into about 13 columns of tab file data I believe.
It was a
Hi Tim,
Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to
We are already using Thrift to move and store our log data and I'm
looking onto how I could read the stored log data into MapReduce
processes. This article
http://www.lexemetech.com/2008/07/rpc-and-serialization-with-hadoop.html
talks about using Thrift for the IO, but it doesn't say anything
speci
I have been processing only 100s GBs on EC2, not 1000's and using 20
nodes and really only in exploration and testing phase right now.
On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock <[EMAIL PROTECTED]> wrote:
> Hi Ryan,
>
> Just a heads up, if you require more than the 20 node limit, Amazon
> p
30 matches
Mail list logo