HDFS

2008-09-12 Thread Monchanin Eric
Hello to all,

I have been attracted by the Hadoop project while looking for a solution
for my application.
Basically, I have an application hosting user generated content (images,
sounds, videos) and I would like to have this available at all time for
all my servers.
Servers will basically add new content, user can manipulate the existing
content, make compositions etc etc ...

We have a few servers (2 for now) dedicated to hosting content, and
right now, they are connected via sshfs on some folders, in order to
shorten the transfert time between these content servers and the
application servers.

Would the Hadoop filesystem be usefull in my case, is it worth digging
into it.

In the case it is doable, how redundant the system is ? for instance, to
store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ?

I hope I made myself clear enough and will get encouraging answers,

Bests to all,

Eric



Re: HDFS

2008-09-12 Thread Mikhail Yakshin
Hi,

> I have been attracted by the Hadoop project while looking for a solution
> for my application.
> Basically, I have an application hosting user generated content (images,
> sounds, videos) and I would like to have this available at all time for
> all my servers.
> Servers will basically add new content, user can manipulate the existing
> content, make compositions etc etc ...
>
> We have a few servers (2 for now) dedicated to hosting content, and
> right now, they are connected via sshfs on some folders, in order to
> shorten the transfert time between these content servers and the
> application servers.
>
> Would the Hadoop filesystem be usefull in my case, is it worth digging
> into it.

I guess not, your best choice would be something like MogileFS. HDFS
is a filesystem optimized for distributed calculations, and thus it
works best with big files (comparable to the size of block, like
64MB). Hosting a lots of smaller files would be an overkill.

-- 
WBR, Mikhail Yakshin


Re: HDFS

2008-09-12 Thread Robert Krüger

Hi Eric,

we are currently building a system for a very similar purpose (digital
asset management) and we use HDFS currently for a volume of approx.
100TB with the option to scale into the PB range. Since we haven't gone
into production yet, I cannot say it will work flawlessly but so far
everything has worked very well with really good performance (especially
read performance which is probably also in your case the most important
factor). The most important thing you have to be aware of IMHO ist that
you will not have a real file system on the OS level. If you use tools
which need that to process the data you will need to do some copying
(which we do in some cases). There is a project out there that makes
HDFS available via FUSE but it appears to be rather alpha which is why
we haven't dared to take a look at it for this project.

Apart from the namenode, which you have to get redundant yourself (lots
of posts in the archives on this topic) you can simply configure the
level of redundancy (see docs).

Hope this helps,

Robert


Monchanin Eric wrote:
> Hello to all,
> 
> I have been attracted by the Hadoop project while looking for a solution
> for my application.
> Basically, I have an application hosting user generated content (images,
> sounds, videos) and I would like to have this available at all time for
> all my servers.
> Servers will basically add new content, user can manipulate the existing
> content, make compositions etc etc ...
> 
> We have a few servers (2 for now) dedicated to hosting content, and
> right now, they are connected via sshfs on some folders, in order to
> shorten the transfert time between these content servers and the
> application servers.
> 
> Would the Hadoop filesystem be usefull in my case, is it worth digging
> into it.
> 
> In the case it is doable, how redundant the system is ? for instance, to
> store 1 MB of data, how much storage do I need (I guess at least 2 MB ...) ?
> 
> I hope I made myself clear enough and will get encouraging answers,
> 
> Bests to all,
> 
> Eric
> 



Re: HDFS

2008-09-12 Thread James Moore
On Fri, Sep 12, 2008 at 3:08 AM, Robert Krüger <[EMAIL PROTECTED]> wrote:
> we are currently building a system for a very similar purpose (digital
> asset management) and we use HDFS currently for a volume of approx.
> 100TB with the option to scale into the PB range.

Robert, would you mind expanding on why you picked HDFS over something
like GFS or MogileFS?  I would have agreed with Mikhail - HDFS seems
like it's purpose-built for Hadoop, and wouldn't necessarily be the
best choice if you just wanted a filesystem.

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: How to move files from one location to another on hadoop

2008-09-12 Thread James Moore
On Wed, Jul 30, 2008 at 2:06 PM, Rutuja Joshi <[EMAIL PROTECTED]> wrote:
> Could anyone suggest any efficient way to move files from one location to
> another on Hadoop. Please note that both the locations are on HDFS.
> I tried looking for inbuilt file system APIs but couldn't find anything
> suitable.

The code you want to start with is:

src/core/org/apache/hadoop/fs/FsShell.java

(in 0.18.0, but I think it's been around for a while)

That's where you'll see the implementation of 'hadoop dfs -mv filea
fileb' - in this case, you're looking for rename().

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: namenode multitreaded

2008-09-12 Thread Raghu Angadi


The core of namenode functionality happens in single thread because of a 
global lock, unfortunately. The other cpus would still be used to some 
extent by network IO and other threads. Usually we don't see just one 
cpu at 100% and nothing else on the other cpus.


What kind of load do you have?

Raghu.

Dmitry Pushkarev wrote:

Hi.

 


My namenode runs on a 8-core server with lots of RAM, but it only uses one
core (100%).

Is it possible to tell namenode to use all available cores?

 


Thanks.






Re: How to move files from one location to another on hadoop

2008-09-12 Thread Chris Douglas
Copying between filesystems, particularly between HDFS filesystems, is  
best done with distcp:


http://hadoop.apache.org/core/docs/r0.18.0/distcp.html

-C

On Sep 12, 2008, at 8:04 AM, James Moore wrote:

On Wed, Jul 30, 2008 at 2:06 PM, Rutuja Joshi <[EMAIL PROTECTED]>  
wrote:
Could anyone suggest any efficient way to move files from one  
location to

another on Hadoop. Please note that both the locations are on HDFS.
I tried looking for inbuilt file system APIs but couldn't find  
anything

suitable.


The code you want to start with is:

src/core/org/apache/hadoop/fs/FsShell.java

(in 0.18.0, but I think it's been around for a while)

That's where you'll see the implementation of 'hadoop dfs -mv filea
fileb' - in this case, you're looking for rename().

--
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com




RE: namenode multitreaded

2008-09-12 Thread Dmitry Pushkarev
I have 15+ million small files I like to process and move around..Thus my
operations doesn't really include datanodes - they're idle when I for
example do FS operations (like sort a bunch of new files written by
tasktracker to appropriate folders). Now I tried to use HADOOP_OPTS=-server
and it seems to help a little, but still performance isn't great. 

Perhaps problem is in the way I play with files - it's perl script over
davf2 over WebDav which uses native API. 

Can anyone give an example of a jython or jruby file that'd recursively go
over a hdfs folder and move all files to a different folder? (My programming
skills are very modest..)


-Original Message-
From: Raghu Angadi [mailto:[EMAIL PROTECTED] 
Sent: Friday, September 12, 2008 9:41 AM
To: core-user@hadoop.apache.org
Subject: Re: namenode multitreaded


The core of namenode functionality happens in single thread because of a 
global lock, unfortunately. The other cpus would still be used to some 
extent by network IO and other threads. Usually we don't see just one 
cpu at 100% and nothing else on the other cpus.

What kind of load do you have?

Raghu.

Dmitry Pushkarev wrote:
> Hi.
> 
>  
> 
> My namenode runs on a 8-core server with lots of RAM, but it only uses one
> core (100%).
> 
> Is it possible to tell namenode to use all available cores?
> 
>  
> 
> Thanks.
> 
> 



Why can't Hadoop be used for online applications ?

2008-09-12 Thread souravm
Hi,

Here is a bsic doubt.

I found in different documentation it is mentioned that Hadoop is not 
recommended for online applications. Can anyone please elaborate on the same ?

Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: Why can't Hadoop be used for online applications ?

2008-09-12 Thread Ryan LeCompte
Hadoop is best suited for distributed processing across many machines
of large data sets. Most people use Hadoop to plow through large data
sets in an offline fashion. One approach that you can use is to use
Hadoop to process your data, then put it in an optimized form in HBase
(i.e., similar to Google's Bigtable). Then, you can use HBase for
querying the data in an online-access fashion. Refer to
http://hadoop.apache.org/hbase/ for more information about HBase.

Ryan


On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Here is a bsic doubt.
>
> I found in different documentation it is mentioned that Hadoop is not 
> recommended for online applications. Can anyone please elaborate on the same ?
>
> Regards,
> Sourav
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>


RE: Why can't Hadoop be used for online applications ?

2008-09-12 Thread souravm
Thanks Ryan for your inputs.

Regards,
Sourav


From: Ryan LeCompte [EMAIL PROTECTED]
Sent: Friday, September 12, 2008 11:55 AM
To: core-user@hadoop.apache.org
Subject: Re: Why can't Hadoop be used for online applications ?

Hadoop is best suited for distributed processing across many machines
of large data sets. Most people use Hadoop to plow through large data
sets in an offline fashion. One approach that you can use is to use
Hadoop to process your data, then put it in an optimized form in HBase
(i.e., similar to Google's Bigtable). Then, you can use HBase for
querying the data in an online-access fashion. Refer to
http://hadoop.apache.org/hbase/ for more information about HBase.

Ryan


On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Here is a bsic doubt.
>
> I found in different documentation it is mentioned that Hadoop is not 
> recommended for online applications. Can anyone please elaborate on the same ?
>
> Regards,
> Sourav
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
> for the use of the addressee(s). If you are not the intended recipient, please
> notify the sender by e-mail and delete the original message. Further, you are 
> not
> to copy, disclose, or distribute this e-mail or its contents to any other 
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has 
> taken
> every reasonable precaution to minimize this risk, but is not liable for any 
> damage
> you may sustain as a result of any virus in this e-mail. You should carry out 
> your
> own virus checks before opening the e-mail or attachment. Infosys reserves the
> right to monitor and review the content of all messages sent to or from this 
> e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>


Re: Why can't Hadoop be used for online applications ?

2008-09-12 Thread Camilo Gonzalez
Hi Ryan!

Does this means that HBase could be used for Online applications, for
example, replacing MySQL in database-driven applications?

Does anyone have any kind of benchmarks about the comparison between MySQL
queries/updates and HBase queries/updates?

Have a nice day,

Camilo.

On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:

> Hadoop is best suited for distributed processing across many machines
> of large data sets. Most people use Hadoop to plow through large data
> sets in an offline fashion. One approach that you can use is to use
> Hadoop to process your data, then put it in an optimized form in HBase
> (i.e., similar to Google's Bigtable). Then, you can use HBase for
> querying the data in an online-access fashion. Refer to
> http://hadoop.apache.org/hbase/ for more information about HBase.
>
> Ryan
>
>
> On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > Here is a bsic doubt.
> >
> > I found in different documentation it is mentioned that Hadoop is not
> recommended for online applications. Can anyone please elaborate on the same
> ?
> >
> > Regards,
> > Sourav
> >
> >  CAUTION - Disclaimer *
> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> > for the use of the addressee(s). If you are not the intended recipient,
> please
> > notify the sender by e-mail and delete the original message. Further, you
> are not
> > to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> > any such actions are unlawful. This e-mail may contain viruses. Infosys
> has taken
> > every reasonable precaution to minimize this risk, but is not liable for
> any damage
> > you may sustain as a result of any virus in this e-mail. You should carry
> out your
> > own virus checks before opening the e-mail or attachment. Infosys
> reserves the
> > right to monitor and review the content of all messages sent to or from
> this e-mail
> > address. Messages sent to or from this e-mail address may be stored on
> the
> > Infosys e-mail system.
> > ***INFOSYS End of Disclaimer INFOSYS***
> >
>


Re: Why can't Hadoop be used for online applications ?

2008-09-12 Thread Ryan LeCompte
Hey Camilo,

HBase is not meant to be a replacement for MySQL or a traditional
RDBMS (HBase is not transaction, for instance). I'd recommend reading
the following article that describes what HBase/Bigtable really is:

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Thanks,
Ryan


On Fri, Sep 12, 2008 at 3:25 PM, Camilo Gonzalez <[EMAIL PROTECTED]> wrote:
> Hi Ryan!
>
> Does this means that HBase could be used for Online applications, for
> example, replacing MySQL in database-driven applications?
>
> Does anyone have any kind of benchmarks about the comparison between MySQL
> queries/updates and HBase queries/updates?
>
> Have a nice day,
>
> Camilo.
>
> On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
>
>> Hadoop is best suited for distributed processing across many machines
>> of large data sets. Most people use Hadoop to plow through large data
>> sets in an offline fashion. One approach that you can use is to use
>> Hadoop to process your data, then put it in an optimized form in HBase
>> (i.e., similar to Google's Bigtable). Then, you can use HBase for
>> querying the data in an online-access fashion. Refer to
>> http://hadoop.apache.org/hbase/ for more information about HBase.
>>
>> Ryan
>>
>>
>> On Fri, Sep 12, 2008 at 2:46 PM, souravm <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> >
>> > Here is a bsic doubt.
>> >
>> > I found in different documentation it is mentioned that Hadoop is not
>> recommended for online applications. Can anyone please elaborate on the same
>> ?
>> >
>> > Regards,
>> > Sourav
>> >
>> >  CAUTION - Disclaimer *
>> > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
>> solely
>> > for the use of the addressee(s). If you are not the intended recipient,
>> please
>> > notify the sender by e-mail and delete the original message. Further, you
>> are not
>> > to copy, disclose, or distribute this e-mail or its contents to any other
>> person and
>> > any such actions are unlawful. This e-mail may contain viruses. Infosys
>> has taken
>> > every reasonable precaution to minimize this risk, but is not liable for
>> any damage
>> > you may sustain as a result of any virus in this e-mail. You should carry
>> out your
>> > own virus checks before opening the e-mail or attachment. Infosys
>> reserves the
>> > right to monitor and review the content of all messages sent to or from
>> this e-mail
>> > address. Messages sent to or from this e-mail address may be stored on
>> the
>> > Infosys e-mail system.
>> > ***INFOSYS End of Disclaimer INFOSYS***
>> >
>>
>


Tips on sorting using Hadoop

2008-09-12 Thread Tenaali Ram
Hi,
I want to sort my records ( consisting of string, int, float) using Hadoop.

One way I have found is to set number of reducers = 1, but this would mean
all the records go to 1 reducer and it won't be optimized. Can anyone point
me to some better way to do sorting using Hadoop ?

Thanks,
Tenaali


aerialization.Deserializer.deserialize method help

2008-09-12 Thread Pete Wyckoff

This method's signature is
{code}
T deserialize(T);
{code}

But, the RecordReader next method is

{code}
boolean next(K,V);
{code}

So, if the deserialize method does not return the same T (i.e., K or V), how
would this new Object be propagated back thru the RecordReader next method.

It seems the contract on the deserialize method is that it must return the
same  T (although the javadocs say "may").

Am I missing something? And if not, why isn't the API boolean deserialize(T)
?

Thanks, pete

Ps for things like Thrift, there's no way to re-use the object as there's no
clear method, so if this is the case, I don't see how it would work??



Accessing input files from different servers

2008-09-12 Thread souravm
Hi,

I would like to process a set of log files (say web server access log) from a 
number of different machines. So I need to get those log files from the 
respective machines to my central HDFS.

To achieve this -
a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) in 
all those machines where the log files are getting created ? And then do a file 
get from the central HDFS server` ?
b) Any other way to achive this ?

Regards,
Sourav

 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Chris Douglas
If you pass in null to the deserializer, it creates a new instance and  
returns it; passing in an instance reuses it.


I don't understand the disconnect between Deserializer and the  
RecordReader. Does your RecordReader generate instances that only  
share a common subtype T? You need separate Deserializers for K and V,  
if that's the issue... -C


On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote:



This method's signature is
{code}
T deserialize(T);
{code}

But, the RecordReader next method is

{code}
boolean next(K,V);
{code}

So, if the deserialize method does not return the same T (i.e., K or  
V), how
would this new Object be propagated back thru the RecordReader next  
method.


It seems the contract on the deserialize method is that it must  
return the

same  T (although the javadocs say "may").

Am I missing something? And if not, why isn't the API boolean  
deserialize(T)

?

Thanks, pete

Ps for things like Thrift, there's no way to re-use the object as  
there's no

clear method, so if this is the case, I don't see how it would work??





Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-12 Thread Steve Loughran

叶双明 wrote:

Thanks for paying attention  to my tentative idea!

What I thought isn't how to store the meradata, but the final (or last) way
to recover valuable data in the cluster when something worst (which destroy
the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
natural disasters destroy half of cluster nodes within all NameNode, we can
recover as much data as possible by this mechanism, and hava big chance to
recover entire data of cluster because fo original replication.



You want to survive any event that loses a datacentre, you need to 
mirror the data off site, chosing that second site with an up to date 
fault line map of the city, geological knowledge of where recent 
eruptions ended up etc. Which is why nobody builds datacentres in 
Enumclaw WA that I'm aware of, the spec for the fabs in/near portland is 
they ought to withstand 1-2m of volcanic ash landing on them (what 
they'd have got if there'd been an easterly wind when Mount Saint Helens 
went). Then once you have some safe location for the second site, talk 
to your telco about how the high-bandwidth backbones in your city flow 
(Metropolitan Area Ethernet and the like), and try and find somewhere 
that meets your requirements.


Then: come up with a protocol that efficiently keeps the two sites up to 
date. And reliably: S3 went down last month because they'd been using a 
Gossip-style update protocol but wheren't checksumming everything, 
because there's no need on a LAN, but of course on a cross-city network 
more things can go wrong, and for them it did.


Something to keep multiple hadoop filesystems synchronised efficiently 
and reliably across sites could be very useful to many people.


-steve


Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Pete Wyckoff

What I mean is let's say I plug in a deserializer that always returns a new
Object - in that case, since everything is pass by value, the new object
cannot make its way back to the SequenceFileRecordReader user.

While(sequenceFileRecordReader.next(mykey, myvalue)) {
  // do something
}

And then my deserializers one/both looks like:

T deserialize(T obj) {
 // ignore obj
  return new T(params);
}

Obj would be the key or the value passed in by the user, but since I ignore
it, basically what happens is the deserialized value actually gets thrown
away. 

More specifically, it gets thrown away in SequenceFile.Reader I believe.

-- pete


On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote:

> If you pass in null to the deserializer, it creates a new instance and
> returns it; passing in an instance reuses it.
> 
> I don't understand the disconnect between Deserializer and the
> RecordReader. Does your RecordReader generate instances that only
> share a common subtype T? You need separate Deserializers for K and V,
> if that's the issue... -C
> 
> On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote:
> 
>> 
>> This method's signature is
>> {code}
>> T deserialize(T);
>> {code}
>> 
>> But, the RecordReader next method is
>> 
>> {code}
>> boolean next(K,V);
>> {code}
>> 
>> So, if the deserialize method does not return the same T (i.e., K or
>> V), how
>> would this new Object be propagated back thru the RecordReader next
>> method.
>> 
>> It seems the contract on the deserialize method is that it must
>> return the
>> same  T (although the javadocs say "may").
>> 
>> Am I missing something? And if not, why isn't the API boolean
>> deserialize(T)
>> ?
>> 
>> Thanks, pete
>> 
>> Ps for things like Thrift, there's no way to re-use the object as
>> there's no
>> clear method, so if this is the case, I don't see how it would work??
>> 
> 



Re: How to manage a large cluster?

2008-09-12 Thread Steve Loughran

James Moore wrote:

On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <[EMAIL PROTECTED]> wrote:

On 9/11/08 2:39 AM, "Alex Loddengaard" <[EMAIL PROTECTED]> wrote:

I've never dealt with a large cluster, though I'd imagine it is managed the
same way as small clusters:

   Maybe. :)


Depends how often you like to be paged, doesn't it :)




   Instead, use a real system configuration management package such as
bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the plug.
:) ]


Yes Allen, I owe you beer at the next apachecon we are both at.
Actually, I think Y! were one of the sponsors at the UK event, so we owe 
you for that too.




Or on EC2 and its competitors, just build a new image whenever you
need to update Hadoop itself.



1. It's still good to have as much automation of your image build as you 
can; if you can build new machine images on demand you have have 
fun/make a mess of things. Look at http://instalinux.com to see the web 
GUI for creating linux images on demand that is used inside HP.


2. When you try and bring up everything from scratch, you have a 
choreography problem. DNS needs to be up early, and your authentication 
system, the management tools, then the other parts of the system. If you 
have a project where hadoop is integrated with the front end site, for 
example, you're app servers have to stay offline until HDFS is live. So 
it does get complex.


3. The Hadoop nodes are good here in that you aren't required to bring 
up the namenode first; the datanodes will wait; same for the task 
trackers and job tracker. But if you, say, need to point everything at a 
new hostname for the namenode, well, that's a config change that needs 
to be pushed out, somehow.




I'm adding some stuff on different ways to deploy hadoop here:

http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

-steve


Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Chris Douglas
Oh, I see what you mean. Yes, you need to reuse the objects that  
you're given in your deserializer.


This will change with HADOOP-1230, though. -C

On Sep 12, 2008, at 2:28 PM, Pete Wyckoff wrote:



What I mean is let's say I plug in a deserializer that always  
returns a new
Object - in that case, since everything is pass by value, the new  
object

cannot make its way back to the SequenceFileRecordReader user.

While(sequenceFileRecordReader.next(mykey, myvalue)) {
 // do something
}

And then my deserializers one/both looks like:

T deserialize(T obj) {
// ignore obj
 return new T(params);
}

Obj would be the key or the value passed in by the user, but since I  
ignore
it, basically what happens is the deserialized value actually gets  
thrown

away.

More specifically, it gets thrown away in SequenceFile.Reader I  
believe.


-- pete


On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote:

If you pass in null to the deserializer, it creates a new instance  
and

returns it; passing in an instance reuses it.

I don't understand the disconnect between Deserializer and the
RecordReader. Does your RecordReader generate instances that only
share a common subtype T? You need separate Deserializers for K and  
V,

if that's the issue... -C

On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote:



This method's signature is
{code}
T deserialize(T);
{code}

But, the RecordReader next method is

{code}
boolean next(K,V);
{code}

So, if the deserialize method does not return the same T (i.e., K or
V), how
would this new Object be propagated back thru the RecordReader next
method.

It seems the contract on the deserialize method is that it must
return the
same  T (although the javadocs say "may").

Am I missing something? And if not, why isn't the API boolean
deserialize(T)
?

Thanks, pete

Ps for things like Thrift, there's no way to re-use the object as
there's no
clear method, so if this is the case, I don't see how it would  
work??










Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Pete Wyckoff

Specifically, line 75 of SequenceFileRecordReader:

>boolean remaining = (in.next(key) != null);

Throws out the return value of SequenceFile.next which is the result of
deserialize(obj).

-- pete


On 9/12/08 2:28 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote:

> 
> What I mean is let's say I plug in a deserializer that always returns a new
> Object - in that case, since everything is pass by value, the new object
> cannot make its way back to the SequenceFileRecordReader user.
> 
> While(sequenceFileRecordReader.next(mykey, myvalue)) {
>   // do something
> }
> 
> And then my deserializers one/both looks like:
> 
> T deserialize(T obj) {
>  // ignore obj
>   return new T(params);
> }
> 
> Obj would be the key or the value passed in by the user, but since I ignore
> it, basically what happens is the deserialized value actually gets thrown
> away. 
> 
> More specifically, it gets thrown away in SequenceFile.Reader I believe.
> 
> -- pete
> 
> 
> On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote:
> 
>> If you pass in null to the deserializer, it creates a new instance and
>> returns it; passing in an instance reuses it.
>> 
>> I don't understand the disconnect between Deserializer and the
>> RecordReader. Does your RecordReader generate instances that only
>> share a common subtype T? You need separate Deserializers for K and V,
>> if that's the issue... -C
>> 
>> On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote:
>> 
>>> 
>>> This method's signature is
>>> {code}
>>> T deserialize(T);
>>> {code}
>>> 
>>> But, the RecordReader next method is
>>> 
>>> {code}
>>> boolean next(K,V);
>>> {code}
>>> 
>>> So, if the deserialize method does not return the same T (i.e., K or
>>> V), how
>>> would this new Object be propagated back thru the RecordReader next
>>> method.
>>> 
>>> It seems the contract on the deserialize method is that it must
>>> return the
>>> same  T (although the javadocs say "may").
>>> 
>>> Am I missing something? And if not, why isn't the API boolean
>>> deserialize(T)
>>> ?
>>> 
>>> Thanks, pete
>>> 
>>> Ps for things like Thrift, there's no way to re-use the object as
>>> there's no
>>> clear method, so if this is the case, I don't see how it would work??
>>> 
>> 
> 



apache.mirror99.com mirror is very out of date

2008-09-12 Thread Emmett Shear
http://apache.mirror99.com/lucene/hadoop  is quite out of date.

Only versions available are 0.14.2 and 0.15.2. 0.14.2 is marked as
"stable" and fails to build out of the box with ant (no target) on
linux. (seems like there are missing .template files or something,
error is from line 133 of build.xml if anyone cares).

Probably should be updated or removed from the mirror list, it's
somewhat confusing if you're new to wind up there, I had to ask for
help in IRC to figure out what was going wrong.

E


Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Pete Wyckoff

Sorry - saw the response after I sent this. But the current javadocs are
wrong and should probably say must return what was passed in.


On 9/12/08 3:02 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote:

> 
> Specifically, line 75 of SequenceFileRecordReader:
> 
>>boolean remaining = (in.next(key) != null);
> 
> Throws out the return value of SequenceFile.next which is the result of
> deserialize(obj).
> 
> -- pete
> 
> 
> On 9/12/08 2:28 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote:
> 
>> 
>> What I mean is let's say I plug in a deserializer that always returns a new
>> Object - in that case, since everything is pass by value, the new object
>> cannot make its way back to the SequenceFileRecordReader user.
>> 
>> While(sequenceFileRecordReader.next(mykey, myvalue)) {
>>   // do something
>> }
>> 
>> And then my deserializers one/both looks like:
>> 
>> T deserialize(T obj) {
>>  // ignore obj
>>   return new T(params);
>> }
>> 
>> Obj would be the key or the value passed in by the user, but since I ignore
>> it, basically what happens is the deserialized value actually gets thrown
>> away. 
>> 
>> More specifically, it gets thrown away in SequenceFile.Reader I believe.
>> 
>> -- pete
>> 
>> 
>> On 9/12/08 2:20 PM, "Chris Douglas" <[EMAIL PROTECTED]> wrote:
>> 
>>> If you pass in null to the deserializer, it creates a new instance and
>>> returns it; passing in an instance reuses it.
>>> 
>>> I don't understand the disconnect between Deserializer and the
>>> RecordReader. Does your RecordReader generate instances that only
>>> share a common subtype T? You need separate Deserializers for K and V,
>>> if that's the issue... -C
>>> 
>>> On Sep 12, 2008, at 2:01 PM, Pete Wyckoff wrote:
>>> 
 
 This method's signature is
 {code}
 T deserialize(T);
 {code}
 
 But, the RecordReader next method is
 
 {code}
 boolean next(K,V);
 {code}
 
 So, if the deserialize method does not return the same T (i.e., K or
 V), how
 would this new Object be propagated back thru the RecordReader next
 method.
 
 It seems the contract on the deserialize method is that it must
 return the
 same  T (although the javadocs say "may").
 
 Am I missing something? And if not, why isn't the API boolean
 deserialize(T)
 ?
 
 Thanks, pete
 
 Ps for things like Thrift, there's no way to re-use the object as
 there's no
 clear method, so if this is the case, I don't see how it would work??
 
>>> 
>> 
> 



Parameterized deserializers?

2008-09-12 Thread Pete Wyckoff

If I have a generic Serializer/Deserializers that take some runtime
information to instantiate, how would this work in the current
serializer/deserializer APIs? And depending on this runtime information, may
return different Objects although they may all derive from the same class.

For example, for Thrift, I may have something called a ThriftSerializer that
is general:

{code}
Public class ThriftDeserializer implements
Deserializer {
  T deserialize(T);
}
{code}

How would I instantiate this, since the current getDeserializer takes only
the Class but not configuration object.
How would I implement createKey in RecordReader


In other words, I think we need a  {code}Class getClass();  {code} method
in Deserializer() and a {code}Deserializer getDeserializer(Class,
Configuration conf); {code} method in Serializer.java.

Or is there another way to do this?

IF not, I can open a JIRA for implementing parameterized serializers.

Thanks, pete





Re: aerialization.Deserializer.deserialize method help

2008-09-12 Thread Owen O'Malley


On Sep 12, 2008, at 3:01 PM, Chris Douglas wrote:

Oh, I see what you mean. Yes, you need to reuse the objects that  
you're given in your deserializer.



This isn't true in the general case. The Java serializer for instance,  
always returns a new instance. The SequenceFile reader has a pair of  
methods:


public Object next(Object key) throws IOException;
public Object nextValue(Object value) throws IOException;

so that you can read java serialized objects from a sequence file.  
They also work as map outputs and reduce outputs. The only place where  
you are hosed is the RecordReader interface.  HADOOP-1230's changes to  
the RecordReader were designed to fix the problem.


-- Owen


Re: Accessing input files from different servers

2008-09-12 Thread Tim Wintle

> a) Do I need to install hadoop and start reunning HDFS (using start-dfs.sh) 
> in all those machines where the log files are getting created ? And then do a 
> file get from the central HDFS server` ?

I'd install hadoop on the machine, but you don't have to start any nodes
there - you can log onto a cluster running elsewhere using the command
line tools to put / get data from the cluster.

From what I recall, this is actually better than running nodes locally
as if you put data on locally, the blocks will tend to be posted to the
local machine.


Tim


signature.asc
Description: This is a digitally signed message part


Re: Parameterized deserializers?

2008-09-12 Thread Pete Wyckoff

I should mention this is out of the context of SequenceFiles where we get
the class names in the file itself. Here there is some information inserted
into the JobConf that tells me the class of the records in the input file.


-- pete


On 9/12/08 3:26 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote:

> 
> If I have a generic Serializer/Deserializers that take some runtime
> information to instantiate, how would this work in the current
> serializer/deserializer APIs? And depending on this runtime information, may
> return different Objects although they may all derive from the same class.
> 
> For example, for Thrift, I may have something called a ThriftSerializer that
> is general:
> 
> {code}
> Public class ThriftDeserializer implements
> Deserializer {
>   T deserialize(T);
> }
> {code}
> 
> How would I instantiate this, since the current getDeserializer takes only
> the Class but not configuration object.
> How would I implement createKey in RecordReader
> 
> 
> In other words, I think we need a  {code}Class getClass();  {code} method
> in Deserializer() and a {code}Deserializer getDeserializer(Class,
> Configuration conf); {code} method in Serializer.java.
> 
> Or is there another way to do this?
> 
> IF not, I can open a JIRA for implementing parameterized serializers.
> 
> Thanks, pete
> 
> 
> 



Re: Parameterized deserializers?

2008-09-12 Thread Tom White
If you make your Serialization implement Configurable it will be given
a Configuration object that it can pass to the Deserializer on
construction.

Also, this thread may be related:
http://www.nabble.com/Serialization-with-additional-schema-info-td19260579.html

Tom

On Sat, Sep 13, 2008 at 12:38 AM, Pete Wyckoff <[EMAIL PROTECTED]> wrote:
>
> I should mention this is out of the context of SequenceFiles where we get
> the class names in the file itself. Here there is some information inserted
> into the JobConf that tells me the class of the records in the input file.
>
>
> -- pete
>
>
> On 9/12/08 3:26 PM, "Pete Wyckoff" <[EMAIL PROTECTED]> wrote:
>
>>
>> If I have a generic Serializer/Deserializers that take some runtime
>> information to instantiate, how would this work in the current
>> serializer/deserializer APIs? And depending on this runtime information, may
>> return different Objects although they may all derive from the same class.
>>
>> For example, for Thrift, I may have something called a ThriftSerializer that
>> is general:
>>
>> {code}
>> Public class ThriftDeserializer implements
>> Deserializer {
>>   T deserialize(T);
>> }
>> {code}
>>
>> How would I instantiate this, since the current getDeserializer takes only
>> the Class but not configuration object.
>> How would I implement createKey in RecordReader
>>
>>
>> In other words, I think we need a  {code}Class getClass();  {code} method
>> in Deserializer() and a {code}Deserializer getDeserializer(Class,
>> Configuration conf); {code} method in Serializer.java.
>>
>> Or is there another way to do this?
>>
>> IF not, I can open a JIRA for implementing parameterized serializers.
>>
>> Thanks, pete
>>
>>
>>
>
>