subject:"Question about Hadoop"

Re: Question about Hadoop filesystem

2009-06-04 Thread Brian Bockelman


It's in the FAQ:

http://wiki.apache.org/hadoop/FAQ#17

Brian

On Jun 4, 2009, at 6:26 PM, Harold Lim wrote:



How do I remove a datanode? Do I simply "destroy" my datanode and  
the namenode will automatically detect it? Is there a more elegent  
way to do it?


Also, when I remove a datanode, does hadoop automatically re- 
replicate the data right away?




Thanks,
Harold

Question about Hadoop filesystem

2009-06-04 Thread Harold Lim


How do I remove a datanode? Do I simply "destroy" my datanode and the namenode 
will automatically detect it? Is there a more elegent way to do it?

Also, when I remove a datanode, does hadoop automatically re-replicate the data 
right away?



Thanks,
Harold

Re: question about hadoop and amazon ec2 ?

2009-02-15 Thread nitesh bhatia

1. They are related as one can use EC2 as a to serve computation part
for hadoop.
Refer: http://wiki.apache.org/hadoop/AmazonEC2

2. yes
Refer: 
http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)

3. you can use EC2 as a to serve computation part for hadoop.

--nitesh

On Sun, Feb 15, 2009 at 2:18 PM, buddha1021  wrote:
>
> hi：
> What is the relationship between the hadoop and the amazon ec2 ?
> Can hadoop run on the common pc (but not server ) directly ?
> Why someone says hadoop run on the amazon ec2 ?
> thanks!
> --
> View this message in context: 
> http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>

-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information & Communication Technology
Gandhinagar
Gujarat

"Life is never perfect. It just depends where you draw the line."

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun

question about hadoop and amazon ec2 ?

2009-02-15 Thread buddha1021


hi：
What is the relationship between the hadoop and the amazon ec2 ?
Can hadoop run on the common pc (but not server ) directly ? 
Why someone says hadoop run on the amazon ec2 ?
thanks!
-- 
View this message in context: 
http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Question about Hadoop 's Feature(s)

2008-09-30 Thread Jason Rutherglen

> However, HDFS uses HTTP to serve blocks up -that needs to be locked down
>  too. Would the signing work there?

I am not familiar with HDFS over HTTP.  Could it simply sign the
stream and include the signature at the end of the HTTP message
returned?

On Tue, Sep 30, 2008 at 8:56 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> Jason Rutherglen wrote:
>>
>> I implemented an RMI protocol using Hadoop IPC and implemented basic
>> HMAC signing.  It is I believe faster than public key private key
>> because it uses a secret key and does not require public key
>> provisioning like PKI would.  Perhaps it would be a baseline way to
>> sign the data.
>
> That should work for authenticating messages between (trusted) nodes.
> Presumably the ipc.key value could be set in the Conf and all would be well.
>
> External job submitters shouldn't be given those keys; they'd need an
> HTTP(S) front end that could authenticate them however the organisation
> worked.
>
> Yes, that would be simpler. I am not enough of a security expert to say if
> it will work, but the keys should be easier to work with. As long as the
> configuration files are kept secure, your cluster will be locked.
>
> However, HDFS uses HTTP to serve blocks up -that needs to be locked down
>  too. Would the signing work there?
>
> -steve
>

Re: Question about Hadoop 's Feature(s)

2008-09-30 Thread Steve Loughran


Jason Rutherglen wrote:

I implemented an RMI protocol using Hadoop IPC and implemented basic
HMAC signing.  It is I believe faster than public key private key
because it uses a secret key and does not require public key
provisioning like PKI would.  Perhaps it would be a baseline way to
sign the data.


That should work for authenticating messages between (trusted) nodes. 
Presumably the ipc.key value could be set in the Conf and all would be well.


External job submitters shouldn't be given those keys; they'd need an 
HTTP(S) front end that could authenticate them however the organisation 
worked.


Yes, that would be simpler. I am not enough of a security expert to say 
if it will work, but the keys should be easier to work with. As long as 
the configuration files are kept secure, your cluster will be locked.


However, HDFS uses HTTP to serve blocks up -that needs to be locked down 
 too. Would the signing work there?


-steve

Re: Question about Hadoop 's Feature(s)

2008-09-29 Thread Jason Rutherglen

I implemented an RMI protocol using Hadoop IPC and implemented basic
HMAC signing.  It is I believe faster than public key private key
because it uses a secret key and does not require public key
provisioning like PKI would.  Perhaps it would be a baseline way to
sign the data.

On Thu, Sep 25, 2008 at 7:47 AM, Steve Loughran <[EMAIL PROTECTED]> wrote:
> Owen O'Malley wrote:
>>
>> On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:
>>
>>> We are developing a project and we are intend to use Hadoop to handle the
>>> processing vast amount of data. But to convince our customers about the
>>> using of Hadoop in our project, we must show them the advantages ( and maybe
>>> ? the disadvantage ) when deploy the project with Hadoop compare to Oracle
>>> Database Platform.
>>
>> The primary advantage of Hadoop is scalability. On an equivalent hardware
>> budget, Hadoop can handle much much larger databases. We had a process that
>> was run once a week on Oracle that is now run once an hour on Hadoop.
>> Additionally, Hadoop scales out much much farther. We can store petabytes of
>> data in a single Hadoop cluster and have jobs that read and generate 100's
>> of terabytes.
>
> That said, what a database gives you -on the right hardware- is very fast
> responses, especially if the indices are set up right and the data
> denormalised when appropriate. There is also really good integration with
> tools and application servers, with things like Java EE designed to make
> running code against a database easy.
>
> Not using Oracle means you don't have to work with an Oracle DBA, which, in
> my experience, can only be a good thing. DBAs and developers never seem to
> see eye-to-eye.
>
>
>>
>>  Hadoop only has very primitive security at the moment, although I expect
>> that to change in the next 6 months.
>>
>
> Right now you need to trust everyone else on the network where you run
> hadoop to not be malicious; the filesystem and job tracker interfaces are
> insecure. The forthcoming 0.19 release will ask who you are, but the far end
> trusts you to be who you say you are. In that respect, it's as secure as NFS
> over UDP.
>
> To secure Hadoop you'd probably need to
>  -sign every IPC request, with a CPU time cost at both ends.
>  -require some form of authentication for the HTTP exported parts of the
> system, such as digest authentication, or issue lots of HTTPS private keys
> and use that instead. Giving everyone a key management problem as well as
> extra communications overhead.
>
> What is easier would be to lock down remote access to the filesystem/job
> submission so that only authenticated users would be able to upload jobs and
> data. The cluster would continue to trust everything else on its network,
> but the system doesn't trust people to submit work unless they could prove
> who they were.
>
>

Re: Question about Hadoop 's Feature(s)

2008-09-25 Thread Steve Loughran


Owen O'Malley wrote:

On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

We are developing a project and we are intend to use Hadoop to handle 
the processing vast amount of data. But to convince our customers 
about the using of Hadoop in our project, we must show them the 
advantages ( and maybe ? the disadvantage ) when deploy the project 
with Hadoop compare to Oracle Database Platform.


The primary advantage of Hadoop is scalability. On an equivalent 
hardware budget, Hadoop can handle much much larger databases. We had a 
process that was run once a week on Oracle that is now run once an hour 
on Hadoop. Additionally, Hadoop scales out much much farther. We can 
store petabytes of data in a single Hadoop cluster and have jobs that 
read and generate 100's of terabytes.


That said, what a database gives you -on the right hardware- is very 
fast responses, especially if the indices are set up right and the data 
denormalised when appropriate. There is also really good integration 
with tools and application servers, with things like Java EE designed to 
make running code against a database easy.


Not using Oracle means you don't have to work with an Oracle DBA, which, 
in my experience, can only be a good thing. DBAs and developers never 
seem to see eye-to-eye.





 Hadoop only has very primitive 
security at the moment, although I expect that to change in the next 6 
months.




Right now you need to trust everyone else on the network where you run 
hadoop to not be malicious; the filesystem and job tracker interfaces 
are insecure. The forthcoming 0.19 release will ask who you are, but the 
far end trusts you to be who you say you are. In that respect, it's as 
secure as NFS over UDP.


To secure Hadoop you'd probably need to
 -sign every IPC request, with a CPU time cost at both ends.
 -require some form of authentication for the HTTP exported parts of 
the system, such as digest authentication, or issue lots of HTTPS 
private keys and use that instead. Giving everyone a key management 
problem as well as extra communications overhead.


What is easier would be to lock down remote access to the filesystem/job 
submission so that only authenticated users would be able to upload jobs 
and data. The cluster would continue to trust everything else on its 
network, but the system doesn't trust people to submit work unless they 
could prove who they were.

Re: Question about Hadoop 's Feature(s)

2008-09-24 Thread Mice

one of the major advantages Hadoop over Oracle: it saves you a lot of $$$

2008/9/25 Trinh Tuan Cuong <[EMAIL PROTECTED]>:
> Dear Mr/Mrs Owen O'Malley,
>
> First I would like to thank you much for your reply, it was somehow the
> exact answer which I expected. As I read about the Query Language of
> Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and
> more... and I could see that Hadoop have an advantage SQL-like query
> language. The most thing I was curous bout is Hadoop's security level
> which is hard to find in any documents I searched. Like many of your
> organization, we are believing in the fast growing of Hadoop and intend
> to use it in our serious projects. Once again, thanks for the reply, now
> I could tell our clients clearly about Hadoop.
>
> Best Regards.
>
> Tuan Cuong, Trinh.
> [EMAIL PROTECTED]
> Luvina Software Company.
> Website : www.luvina.net
>
> -Original Message-
> From: Owen O'Malley [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 24, 2008 11:27 PM
> To: core-user@hadoop.apache.org
> Subject: Re: Question about Hadoop 's Feature(s)
>
> On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:
>
>> We are developing a project and we are intend to use Hadoop to
>> handle the processing vast amount of data. But to convince our
>> customers about the using of Hadoop in our project, we must show
>> them the advantages ( and maybe ? the disadvantage ) when deploy the
>> project with Hadoop compare to Oracle Database Platform.
>
> The primary advantage of Hadoop is scalability. On an equivalent
> hardware budget, Hadoop can handle much much larger databases. We had
> a process that was run once a week on Oracle that is now run once an
> hour on Hadoop. Additionally, Hadoop scales out much much farther. We
> can store petabytes of data in a single Hadoop cluster and have jobs
> that read and generate 100's of terabytes.
>
> The disadvantage of Hadoop is that it is still relatively young and
> growing fast, so there are growing pains. Hadoop has recently gotten
> higher level query languages like SQL (Pig, Hive, and Jaql), but still
> doesn't have any fancy report generators. Hadoop only has very
> primitive security at the moment, although I expect that to change in
> the next 6 months.
>
> -- Owen
>
>
>

RE: Question about Hadoop 's Feature(s)

2008-09-24 Thread Trinh Tuan Cuong

Dear Mr/Mrs Owen O'Malley,

First I would like to thank you much for your reply, it was somehow the
exact answer which I expected. As I read about the Query Language of
Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and
more... and I could see that Hadoop have an advantage SQL-like query
language. The most thing I was curous bout is Hadoop's security level
which is hard to find in any documents I searched. Like many of your
organization, we are believing in the fast growing of Hadoop and intend
to use it in our serious projects. Once again, thanks for the reply, now
I could tell our clients clearly about Hadoop.

Best Regards.

Tuan Cuong, Trinh.
[EMAIL PROTECTED]
Luvina Software Company.
Website : www.luvina.net

-Original Message-
From: Owen O'Malley [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 11:27 PM
To: core-user@hadoop.apache.org
Subject: Re: Question about Hadoop 's Feature(s)

On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

> We are developing a project and we are intend to use Hadoop to  
> handle the processing vast amount of data. But to convince our  
> customers about the using of Hadoop in our project, we must show  
> them the advantages ( and maybe ? the disadvantage ) when deploy the  
> project with Hadoop compare to Oracle Database Platform.

The primary advantage of Hadoop is scalability. On an equivalent  
hardware budget, Hadoop can handle much much larger databases. We had  
a process that was run once a week on Oracle that is now run once an  
hour on Hadoop. Additionally, Hadoop scales out much much farther. We  
can store petabytes of data in a single Hadoop cluster and have jobs  
that read and generate 100's of terabytes.

The disadvantage of Hadoop is that it is still relatively young and  
growing fast, so there are growing pains. Hadoop has recently gotten  
higher level query languages like SQL (Pig, Hive, and Jaql), but still  
doesn't have any fancy report generators. Hadoop only has very  
primitive security at the moment, although I expect that to change in  
the next 6 months.

-- Owen

Re: Question about Hadoop 's Feature(s)

2008-09-24 Thread Owen O'Malley


On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote:

We are developing a project and we are intend to use Hadoop to  
handle the processing vast amount of data. But to convince our  
customers about the using of Hadoop in our project, we must show  
them the advantages ( and maybe ? the disadvantage ) when deploy the  
project with Hadoop compare to Oracle Database Platform.


The primary advantage of Hadoop is scalability. On an equivalent  
hardware budget, Hadoop can handle much much larger databases. We had  
a process that was run once a week on Oracle that is now run once an  
hour on Hadoop. Additionally, Hadoop scales out much much farther. We  
can store petabytes of data in a single Hadoop cluster and have jobs  
that read and generate 100's of terabytes.


The disadvantage of Hadoop is that it is still relatively young and  
growing fast, so there are growing pains. Hadoop has recently gotten  
higher level query languages like SQL (Pig, Hive, and Jaql), but still  
doesn't have any fancy report generators. Hadoop only has very  
primitive security at the moment, although I expect that to change in  
the next 6 months.


-- Owen

Question about Hadoop 's Feature(s)

2008-09-24 Thread Trinh Tuan Cuong

Hi,

 

We are developing a project and we are intend to use Hadoop to handle the 
processing vast amount of data. But to convince our customers about the using 
of Hadoop in our project, we must show them the advantages ( and maybe ? the 
disadvantage ) when deploy the project with Hadoop compare to Oracle Database 
Platform.

 

 So I would like to have a full feature(s) list of Hadoop, which would describe 
what features are integrated in the lastest version of Hadoop ( 0.18.1). 
Especially  features related to the area of manipulating database – support 
user facilitating the use of database – and maybe some features about 
security,platform supports which are not related to the manipulating database 
process.

 

P.S : I did google and yahoo the features list of Hadoop for several days but 
no clues or just a small features of HDFS or MapReduce, Hadoop on Demand 
stand-alone, what I really want is the completed feature list of the leastest 
version. Thanks in advance if any help, or links.

 

Best Regards,

 

Trịnh Tuấn Cường

 

Luvina Software Company

Website : www.luvina.net

 

Address : 1001 Hoang Quoc Viet Street

Email : [EMAIL PROTECTED],[EMAIL PROTECTED]

Mobile : 097 4574 457

Re: Question about Hadoop

2008-06-14 Thread Chanchal James

Thank you very much for explaining it to me, Ted.. Thats a great deal of
info!
I guess that could be how "Yahoo Webmap" is designed..

And for anyone trying to figure out the massiveness of Hadoop computing,
http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should
give a good picture of a practical case. I was for a moment
flabbergasted, and instantly fell in love with Hadoop! ;)


On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> Usually hadoop programs are not used interactively since what they excel at
> is batch operations on very large collections of data.
>
> It is quite reasonable to store resulting data in hadoop and access those
> results using hadoop.  The cleanest way to do that is to have a
> presentation
> layer web server that has all of the UI on it and use http to access the
> results file from hadoop via the namenodes data access URL.  This works
> well
> where the results are not particularly voluminous.
>
> For large quantities of data such as the output of a web-crawl, it is
> usually better to copy the output out of hadoop and into a clustered system
> that supports high speed querying of the data.  This clustered system might
> be as simple as a redundant memcache or mySql farm or as fancy as a sharded
> and replicated farm of text retrieval engines running under Solr.  What
> works for you will vary by what you need to do.
>
> You should keep in mind that hadoop was designed for very long MTBF (for a
> cluster), but not designed for zero downtime operation.  At the very least,
> you will occasionally want to upgrade the cluster software and that
> currently can't be done during normal operations.  Combining hadoop (for
> heavy duty computations) with a separate persistence layer (for high
> availability web service) is a good hybrid.
>
> On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James <[EMAIL PROTECTED]>
> wrote:
>
> > Thank you all for the responses.
> >
> > So in order to run a web-based application, I just need to put the part
> of
> > the application that needs to make use of distributed computation in
> HDFS,
> > and have the other web site related files access it via Hadoop streaming
> ?
> >
> > Is that how Hadoop is used ?
> >
> > Sorry the question may sound too silly.
> >
> > Thank you.
> >
> >
>
> --
> ted
>

Re: Question about Hadoop

2008-06-13 Thread Ted Dunning

Usually hadoop programs are not used interactively since what they excel at
is batch operations on very large collections of data.

It is quite reasonable to store resulting data in hadoop and access those
results using hadoop.  The cleanest way to do that is to have a presentation
layer web server that has all of the UI on it and use http to access the
results file from hadoop via the namenodes data access URL.  This works well
where the results are not particularly voluminous.

For large quantities of data such as the output of a web-crawl, it is
usually better to copy the output out of hadoop and into a clustered system
that supports high speed querying of the data.  This clustered system might
be as simple as a redundant memcache or mySql farm or as fancy as a sharded
and replicated farm of text retrieval engines running under Solr.  What
works for you will vary by what you need to do.

You should keep in mind that hadoop was designed for very long MTBF (for a
cluster), but not designed for zero downtime operation.  At the very least,
you will occasionally want to upgrade the cluster software and that
currently can't be done during normal operations.  Combining hadoop (for
heavy duty computations) with a separate persistence layer (for high
availability web service) is a good hybrid.

On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James <[EMAIL PROTECTED]> wrote:

> Thank you all for the responses.
>
> So in order to run a web-based application, I just need to put the part of
> the application that needs to make use of distributed computation in HDFS,
> and have the other web site related files access it via Hadoop streaming ?
>
> Is that how Hadoop is used ?
>
> Sorry the question may sound too silly.
>
> Thank you.
>
>
> On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning <[EMAIL PROTECTED]>
> wrote:
>
> > Once it is in HDFS, you already have backups (due to the replicated file
> > system).
> >
> > Your problems with deleting the dfs data directory are likely
> configuration
> > problems combined with versioning of the data store (done to avoid
> > confusion, but usually causes confusion).  Once you get the configuration
> > and operational issues sorted out, you shouldn't lose any data.
> >
> > On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]>
> > wrote:
> >
> > >
> > > If I keep all data in HDFS, is there anyway I can back it up regularly.
> > >
> > >
> >
>

-- 
ted

Re: Question about Hadoop

2008-06-12 Thread Chanchal James

Thank you all for the responses.

So in order to run a web-based application, I just need to put the part of
the application that needs to make use of distributed computation in HDFS,
and have the other web site related files access it via Hadoop streaming ?

Is that how Hadoop is used ?

Sorry the question may sound too silly.

Thank you.

On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

> Once it is in HDFS, you already have backups (due to the replicated file
> system).
>
> Your problems with deleting the dfs data directory are likely configuration
> problems combined with versioning of the data store (done to avoid
> confusion, but usually causes confusion).  Once you get the configuration
> and operational issues sorted out, you shouldn't lose any data.
>
> On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]>
> wrote:
>
> >
> > If I keep all data in HDFS, is there anyway I can back it up regularly.
> >
> >
>

Re: Question about Hadoop

2008-06-12 Thread Ted Dunning

Once it is in HDFS, you already have backups (due to the replicated file
system).

Your problems with deleting the dfs data directory are likely configuration
problems combined with versioning of the data store (done to avoid
confusion, but usually causes confusion).  Once you get the configuration
and operational issues sorted out, you shouldn't lose any data.

On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]> wrote:

>
> If I keep all data in HDFS, is there anyway I can back it up regularly.
>
>

RE: Question about Hadoop

2008-06-12 Thread Haijun Cao

Looks good to me... 



-Original Message-
From: Chanchal James [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 11:22 AM
To: core-user@hadoop.apache.org
Subject: Re: Question about Hadoop

Haijun, I have most of the settings as default, but not tmp dir. I have
the
tmp dir set to "/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}".
Is
this a good location ?


On Thu, Jun 12, 2008 at 12:59 PM, Haijun Cao <[EMAIL PROTECTED]>
wrote:

>
> "While testing I had to delete the temporary "datastore" folder and
> reformat
> the file system a couple of times."
>
> Is it because you leave hadoop.tmp.dir and other .dir parameter as
> default? Try to set hadoop.tmp.dir to a dir not under /tmp.
>
> 
>  hadoop.tmp.dir
>  /tmp/hadoop-${user.name}
>  A base for other temporary directories.
> 
>
> Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name:
>
> 
>  dfs.name.dir
>  ${hadoop.tmp.dir}/dfs/name
> 
>
> Haijun
>
> -Original Message-
> From: Chanchal James [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 12, 2008 10:16 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Question about Hadoop
>
> Thanks Lohit for the info. I have one more question.
> If I keep all data in HDFS, is there anyway I can back it up
regularly.
> While testing I had to delete the temporary "datastore" folder and
> reformat
> the file system a couple of times. So while using Hadoop in a real
> environment, what are the chances of such software side uncorrectable
> problems to occur. Can we correct it without a reformat ? I cannot
> afford to
> loose the data I plan to put in HDFS.
>
> Thank you.
>
> On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote:
>
> > Ideally what you would want is your data to be on HDFS and run your
> > map/reduce jobs on that data. Hadoop framework splits you data and
> feeds in
> > those splits to each map or reduce task. One problem with Image
files
> is
> > that you will not be able to split them. Alternatively people have
> done
> > this, they wrap Image files within xml and create huge files which
has
> > multiple image files in them. Hadoop offers something called
streaming
> using
> > which you will be able to split the files at xml boundry and feed it
> to your
> > map/reduce tasks. Streaming also enables you to use any code like
> > perl/php/c++.
> > Check info about streaming here
> > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
> > And information about parsing XML files in streaming in here
> >
>
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse
>
+XML+documents+using+streaming%3F<http://hadoop.apache.org/core/docs/r0.
17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F>
> >
> > Thanks,
> > Lohit
> >
> > - Original Message 
> > From: Chanchal James <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, June 12, 2008 9:42:46 AM
> > Subject: Question about Hadoop
> >
> > Hi,
> >
> >
> >
> > I have a question about Hadoop. I am a beginner and just testing
> Hadoop.
> >
> > Would like to know how a php application would benefit from this,
say
> an
> >
> > application that needs to work on a large number of image files. Do
I
> have
> > to
> >
> > store the application in HDFS always, or do I just copy it to HDFS
> when
> >
> > needed, do the processing, and then copy it back to the local file
> system ?
> >
> > Is that the case with the data files too ? Once I have Hadoop
running,
> do I
> >
> > keep all data & application files in HDFS always, and not use local
> file
> >
> > system storage ?
> >
> >
> >
> > Thank you.
> >
> >
>

Re: Question about Hadoop

2008-06-12 Thread Chanchal James

Haijun, I have most of the settings as default, but not tmp dir. I have the
tmp dir set to "/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}". Is
this a good location ?


On Thu, Jun 12, 2008 at 12:59 PM, Haijun Cao <[EMAIL PROTECTED]> wrote:

>
> "While testing I had to delete the temporary "datastore" folder and
> reformat
> the file system a couple of times."
>
> Is it because you leave hadoop.tmp.dir and other .dir parameter as
> default? Try to set hadoop.tmp.dir to a dir not under /tmp.
>
> 
>  hadoop.tmp.dir
>  /tmp/hadoop-${user.name}
>  A base for other temporary directories.
> 
>
> Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name:
>
> 
>  dfs.name.dir
>  ${hadoop.tmp.dir}/dfs/name
> 
>
> Haijun
>
> -Original Message-
> From: Chanchal James [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 12, 2008 10:16 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Question about Hadoop
>
> Thanks Lohit for the info. I have one more question.
> If I keep all data in HDFS, is there anyway I can back it up regularly.
> While testing I had to delete the temporary "datastore" folder and
> reformat
> the file system a couple of times. So while using Hadoop in a real
> environment, what are the chances of such software side uncorrectable
> problems to occur. Can we correct it without a reformat ? I cannot
> afford to
> loose the data I plan to put in HDFS.
>
> Thank you.
>
> On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote:
>
> > Ideally what you would want is your data to be on HDFS and run your
> > map/reduce jobs on that data. Hadoop framework splits you data and
> feeds in
> > those splits to each map or reduce task. One problem with Image files
> is
> > that you will not be able to split them. Alternatively people have
> done
> > this, they wrap Image files within xml and create huge files which has
> > multiple image files in them. Hadoop offers something called streaming
> using
> > which you will be able to split the files at xml boundry and feed it
> to your
> > map/reduce tasks. Streaming also enables you to use any code like
> > perl/php/c++.
> > Check info about streaming here
> > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
> > And information about parsing XML files in streaming in here
> >
> http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse
> +XML+documents+using+streaming%3F<http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F>
> >
> > Thanks,
> > Lohit
> >
> > - Original Message 
> > From: Chanchal James <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, June 12, 2008 9:42:46 AM
> > Subject: Question about Hadoop
> >
> > Hi,
> >
> >
> >
> > I have a question about Hadoop. I am a beginner and just testing
> Hadoop.
> >
> > Would like to know how a php application would benefit from this, say
> an
> >
> > application that needs to work on a large number of image files. Do I
> have
> > to
> >
> > store the application in HDFS always, or do I just copy it to HDFS
> when
> >
> > needed, do the processing, and then copy it back to the local file
> system ?
> >
> > Is that the case with the data files too ? Once I have Hadoop running,
> do I
> >
> > keep all data & application files in HDFS always, and not use local
> file
> >
> > system storage ?
> >
> >
> >
> > Thank you.
> >
> >
>

RE: Question about Hadoop

2008-06-12 Thread Haijun Cao


"While testing I had to delete the temporary "datastore" folder and
reformat
the file system a couple of times."

Is it because you leave hadoop.tmp.dir and other .dir parameter as
default? Try to set hadoop.tmp.dir to a dir not under /tmp.


  hadoop.tmp.dir
  /tmp/hadoop-${user.name}
  A base for other temporary directories.


Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name:


  dfs.name.dir
  ${hadoop.tmp.dir}/dfs/name


Haijun

-Original Message-
From: Chanchal James [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 12, 2008 10:16 AM
To: core-user@hadoop.apache.org
Subject: Re: Question about Hadoop

Thanks Lohit for the info. I have one more question.
If I keep all data in HDFS, is there anyway I can back it up regularly.
While testing I had to delete the temporary "datastore" folder and
reformat
the file system a couple of times. So while using Hadoop in a real
environment, what are the chances of such software side uncorrectable
problems to occur. Can we correct it without a reformat ? I cannot
afford to
loose the data I plan to put in HDFS.

Thank you.

On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote:

> Ideally what you would want is your data to be on HDFS and run your
> map/reduce jobs on that data. Hadoop framework splits you data and
feeds in
> those splits to each map or reduce task. One problem with Image files
is
> that you will not be able to split them. Alternatively people have
done
> this, they wrap Image files within xml and create huge files which has
> multiple image files in them. Hadoop offers something called streaming
using
> which you will be able to split the files at xml boundry and feed it
to your
> map/reduce tasks. Streaming also enables you to use any code like
> perl/php/c++.
> Check info about streaming here
> http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
> And information about parsing XML files in streaming in here
>
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse
+XML+documents+using+streaming%3F
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Chanchal James <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, June 12, 2008 9:42:46 AM
> Subject: Question about Hadoop
>
> Hi,
>
>
>
> I have a question about Hadoop. I am a beginner and just testing
Hadoop.
>
> Would like to know how a php application would benefit from this, say
an
>
> application that needs to work on a large number of image files. Do I
have
> to
>
> store the application in HDFS always, or do I just copy it to HDFS
when
>
> needed, do the processing, and then copy it back to the local file
system ?
>
> Is that the case with the data files too ? Once I have Hadoop running,
do I
>
> keep all data & application files in HDFS always, and not use local
file
>
> system storage ?
>
>
>
> Thank you.
>
>

Re: Question about Hadoop

2008-06-12 Thread Chanchal James

Thanks Lohit for the info. I have one more question.
If I keep all data in HDFS, is there anyway I can back it up regularly.
While testing I had to delete the temporary "datastore" folder and reformat
the file system a couple of times. So while using Hadoop in a real
environment, what are the chances of such software side uncorrectable
problems to occur. Can we correct it without a reformat ? I cannot afford to
loose the data I plan to put in HDFS.

Thank you.

On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote:

> Ideally what you would want is your data to be on HDFS and run your
> map/reduce jobs on that data. Hadoop framework splits you data and feeds in
> those splits to each map or reduce task. One problem with Image files is
> that you will not be able to split them. Alternatively people have done
> this, they wrap Image files within xml and create huge files which has
> multiple image files in them. Hadoop offers something called streaming using
> which you will be able to split the files at xml boundry and feed it to your
> map/reduce tasks. Streaming also enables you to use any code like
> perl/php/c++.
> Check info about streaming here
> http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
> And information about parsing XML files in streaming in here
> http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Chanchal James <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, June 12, 2008 9:42:46 AM
> Subject: Question about Hadoop
>
> Hi,
>
>
>
> I have a question about Hadoop. I am a beginner and just testing Hadoop.
>
> Would like to know how a php application would benefit from this, say an
>
> application that needs to work on a large number of image files. Do I have
> to
>
> store the application in HDFS always, or do I just copy it to HDFS when
>
> needed, do the processing, and then copy it back to the local file system ?
>
> Is that the case with the data files too ? Once I have Hadoop running, do I
>
> keep all data & application files in HDFS always, and not use local file
>
> system storage ?
>
>
>
> Thank you.
>
>

Re: Question about Hadoop

2008-06-12 Thread lohit

Ideally what you would want is your data to be on HDFS and run your map/reduce 
jobs on that data. Hadoop framework splits you data and feeds in those splits 
to each map or reduce task. One problem with Image files is that you will not 
be able to split them. Alternatively people have done this, they wrap Image 
files within xml and create huge files which has multiple image files in them. 
Hadoop offers something called streaming using which you will be able to split 
the files at xml boundry and feed it to your map/reduce tasks. Streaming also 
enables you to use any code like perl/php/c++. 
Check info about streaming here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html
And information about parsing XML files in streaming in here 
http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F

Thanks,
Lohit

- Original Message 
From: Chanchal James <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, June 12, 2008 9:42:46 AM
Subject: Question about Hadoop

Hi,



I have a question about Hadoop. I am a beginner and just testing Hadoop.

Would like to know how a php application would benefit from this, say an

application that needs to work on a large number of image files. Do I have
to

store the application in HDFS always, or do I just copy it to HDFS when

needed, do the processing, and then copy it back to the local file system ?

Is that the case with the data files too ? Once I have Hadoop running, do I

keep all data & application files in HDFS always, and not use local file

system storage ?



Thank you.

Question about Hadoop

2008-06-12 Thread Chanchal James

Hi,



I have a question about Hadoop. I am a beginner and just testing Hadoop.

Would like to know how a php application would benefit from this, say an

application that needs to work on a large number of image files. Do I have
to

store the application in HDFS always, or do I just copy it to HDFS when

needed, do the processing, and then copy it back to the local file system ?

Is that the case with the data files too ? Once I have Hadoop running, do I

keep all data & application files in HDFS always, and not use local file

system storage ?



Thank you.

Re: question about hadoop 0.17 upgrade

2008-05-25 Thread Samuel Guo


־Զ 写道:


upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. 
How can I do with it? Thanks!


I have use the “start-dfs.sh –upgrade” command to upgrade the filesystem

below is the error log:

2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker: 
STARTUP_MSG:


/

STARTUP_MSG: Starting JobTracker

STARTUP_MSG: host = test180.sqa/192.168.207.180

STARTUP_MSG: args = []

STARTUP_MSG: version = 0.17.0

STARTUP_MSG: build = 
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 
656523; compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008


/

2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=JobTracker, port=9001


2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server 
Responder: starting


2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server 
listener on 9001: starting


2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 0 on 9001: starting


2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 1 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 2 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 3 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 4 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 5 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 6 on 9001: starting


2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 7 on 9001: starting


2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 8 on 9001: starting


2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server 
handler 9 on 9001: starting


2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking 
Resource aliases


2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version 
Jetty/5.1.4


2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started 
HttpContext[/static,/static]


2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started 
HttpContext[/logs,/logs]


2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started 
[EMAIL PROTECTED]


2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started 
WebApplicationContext[/,/]


2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started 
SocketListener on 0.0.0.0:50030


2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started 
[EMAIL PROTECTED]


2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=JobTracker, sessionId=


2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: 
JobTracker up at: 9001


2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: 
JobTracker webserver: 50030


2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: 
problem cleaning system directory: 
/home/hadoop/HadoopInstall/tmp/mapred/system


org.apache.hadoop.ipc.RemoteException: 
org.apache.hadoop.dfs.SafeModeException: Cannot delete 
/home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode.


The ratio of reported blocks 0. has not reached the threshold 
0.9990. Safe mode will be turned off automatically.


at 
org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519)


at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498)

at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)


at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)


at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)

at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)


at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)


at java.lang.reflect.Method.invoke(Method.java:597)

at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)


at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)


at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at org.apache.hadoop.dfs.DFSClient.delete(DFSClient.java:535)

at 
org.apache.hadoop.dfs.DistributedFileSyste

question about hadoop 0.17 upgrade

2008-05-25 Thread 志远

 

upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. How can
I do with it? Thanks!

 

I have use the “start-dfs.sh �Cupgrade” command to upgrade the filesystem

 

below is the error log:

 

2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker:
STARTUP_MSG: 

/

STARTUP_MSG: Starting JobTracker

STARTUP_MSG:   host = test180.sqa/192.168.207.180

STARTUP_MSG:   args = []

STARTUP_MSG:   version = 0.17.0

STARTUP_MSG:   build =
http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523;
compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008

/

2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=JobTracker, port=9001

2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 9001: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 9001: starting

2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 9001: starting

2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 9001: starting

2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 9001: starting

2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 9001: starting

2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking Resource
aliases

2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.
4

2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started
HttpContext[/static,/static]

2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started
HttpContext[/logs,/logs]

2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]

2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started
WebApplicationContext[/,/]

2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started
SocketListener on 0.0.0.0:50030

2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started
[EMAIL PROTECTED]

2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Initializing JVM Metrics with processName=JobTracker, sessionId=

2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
up at: 9001

2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker
webserver: 50030

2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: problem
cleaning system directory: /home/hadoop/HadoopInstall/tmp/mapred/system

org.apache.hadoop.ipc.RemoteException:
org.apache.hadoop.dfs.SafeModeException: Cannot delete
/home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode.

The ratio of reported blocks 0. has not reached the threshold 0.9990.
Safe mode will be turned off automatically.

at
org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519)

at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498)

at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)

at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

 

at org.apache.hadoop.ipc.Client.call(Client.java:557)

at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)

at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
)

at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl
.java:25)

at java.lang.reflect.Method.invoke(Method.java:597)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati
onHandler.java:82)

at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand
ler.java:59)

at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source)

at o

Re: One Simple Question About Hadoop DFS

2008-03-23 Thread Amar Kamat

On Sun, 23 Mar 2008, Chaman Singh Verma wrote:

> Hello,
>
> I am exploring Hadoop and MapReduce and I have one very simple question.
>
> I have 500GB dataset on my local disk and I have written both Map-Reduce 
> functions. Now how should I start ?
>
> 1.  I copy the data from local disk to DFS. I have configured DFS with 100 
> machines. I hope that it will split the file on 100 nodes ( With some 
> replications).
>
Yes. You need to copy the data from your local disk to the DFS. It will
split the files based on the dfs block size (dfs.block.size). Default
block size is 64MB and hence there would be 8000 blocks.
> 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify
>less than 100 then, will be blocks migrate ? If the blocks don't migrate 
> then
>why this functions is provided to the users ? Why number of Tasks is not
>taken from the startup script ?
>
Again here the max number of maps is bounded by the dfs block size. Hence
in the default case you would have 8000 maps (unless you have your own
input format).
> 3.  If I specify more than 100, then will load balancing be done automatically
> or user have to specify that also.
>
In short its the dfs block size along with the input format that controls
the number of maps. The number of maps given to the framework is used as
a hint. Sometimes it doesn't matter what value is passed.
Amar
> Perhaps these are very simple questions, but I think that MapReduce 
> simplifies lots of things ( Compared to MPI Based Programming ) that for 
> beginners like me have difficult time to understand the model.
>
> csv
>
>
>
>
> -
> Never miss a thing.   Make Yahoo your homepage.

One Simple Question About Hadoop DFS

2008-03-23 Thread Chaman Singh Verma

Hello,

I am exploring Hadoop and MapReduce and I have one very simple question.

I have 500GB dataset on my local disk and I have written both Map-Reduce 
functions. Now how should I start ?

1.  I copy the data from local disk to DFS. I have configured DFS with 100 
machines. I hope that it will split the file on 100 nodes ( With some 
replications).

2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify
   less than 100 then, will be blocks migrate ? If the blocks don't migrate then
   why this functions is provided to the users ? Why number of Tasks is not 
   taken from the startup script ?

3.  If I specify more than 100, then will load balancing be done automatically
or user have to specify that also.

Perhaps these are very simple questions, but I think that MapReduce simplifies 
lots of things ( Compared to MPI Based Programming ) that for beginners like me 
have difficult time to understand the model.

csv



   
-
Never miss a thing.   Make Yahoo your homepage.

Re: Question about Hadoop filesystem

Question about Hadoop filesystem

Re: question about hadoop and amazon ec2 ?

question about hadoop and amazon ec2 ?

Re: Question about Hadoop 's Feature(s)

Re: Question about Hadoop 's Feature(s)

Re: Question about Hadoop 's Feature(s)

Re: Question about Hadoop 's Feature(s)

Re: Question about Hadoop 's Feature(s)

RE: Question about Hadoop 's Feature(s)

Re: Question about Hadoop 's Feature(s)

Question about Hadoop 's Feature(s)

Re: Question about Hadoop

Re: Question about Hadoop

Re: Question about Hadoop

Re: Question about Hadoop

RE: Question about Hadoop

Re: Question about Hadoop

RE: Question about Hadoop

Re: Question about Hadoop

Re: Question about Hadoop

Question about Hadoop

Re: question about hadoop 0.17 upgrade

question about hadoop 0.17 upgrade

Re: One Simple Question About Hadoop DFS

One Simple Question About Hadoop DFS

26 matches

Site Navigation

Mail list logo

Footer information