Re: Question about Hadoop filesystem
It's in the FAQ: http://wiki.apache.org/hadoop/FAQ#17 Brian On Jun 4, 2009, at 6:26 PM, Harold Lim wrote: How do I remove a datanode? Do I simply "destroy" my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode, does hadoop automatically re- replicate the data right away? Thanks, Harold
Question about Hadoop filesystem
How do I remove a datanode? Do I simply "destroy" my datanode and the namenode will automatically detect it? Is there a more elegent way to do it? Also, when I remove a datanode, does hadoop automatically re-replicate the data right away? Thanks, Harold
Re: question about hadoop and amazon ec2 ?
1. They are related as one can use EC2 as a to serve computation part for hadoop. Refer: http://wiki.apache.org/hadoop/AmazonEC2 2. yes Refer: http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster) 3. you can use EC2 as a to serve computation part for hadoop. --nitesh On Sun, Feb 15, 2009 at 2:18 PM, buddha1021 wrote: > > hi: > What is the relationship between the hadoop and the amazon ec2 ? > Can hadoop run on the common pc (but not server ) directly ? > Why someone says hadoop run on the amazon ec2 ? > thanks! > -- > View this message in context: > http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Nitesh Bhatia Dhirubhai Ambani Institute of Information & Communication Technology Gandhinagar Gujarat "Life is never perfect. It just depends where you draw the line." visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
question about hadoop and amazon ec2 ?
hi: What is the relationship between the hadoop and the amazon ec2 ? Can hadoop run on the common pc (but not server ) directly ? Why someone says hadoop run on the amazon ec2 ? thanks! -- View this message in context: http://www.nabble.com/question-about-hadoop-and-amazon-ec2---tp22020652p22020652.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Question about Hadoop 's Feature(s)
> However, HDFS uses HTTP to serve blocks up -that needs to be locked down > too. Would the signing work there? I am not familiar with HDFS over HTTP. Could it simply sign the stream and include the signature at the end of the HTTP message returned? On Tue, Sep 30, 2008 at 8:56 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Jason Rutherglen wrote: >> >> I implemented an RMI protocol using Hadoop IPC and implemented basic >> HMAC signing. It is I believe faster than public key private key >> because it uses a secret key and does not require public key >> provisioning like PKI would. Perhaps it would be a baseline way to >> sign the data. > > That should work for authenticating messages between (trusted) nodes. > Presumably the ipc.key value could be set in the Conf and all would be well. > > External job submitters shouldn't be given those keys; they'd need an > HTTP(S) front end that could authenticate them however the organisation > worked. > > Yes, that would be simpler. I am not enough of a security expert to say if > it will work, but the keys should be easier to work with. As long as the > configuration files are kept secure, your cluster will be locked. > > However, HDFS uses HTTP to serve blocks up -that needs to be locked down > too. Would the signing work there? > > -steve >
Re: Question about Hadoop 's Feature(s)
Jason Rutherglen wrote: I implemented an RMI protocol using Hadoop IPC and implemented basic HMAC signing. It is I believe faster than public key private key because it uses a secret key and does not require public key provisioning like PKI would. Perhaps it would be a baseline way to sign the data. That should work for authenticating messages between (trusted) nodes. Presumably the ipc.key value could be set in the Conf and all would be well. External job submitters shouldn't be given those keys; they'd need an HTTP(S) front end that could authenticate them however the organisation worked. Yes, that would be simpler. I am not enough of a security expert to say if it will work, but the keys should be easier to work with. As long as the configuration files are kept secure, your cluster will be locked. However, HDFS uses HTTP to serve blocks up -that needs to be locked down too. Would the signing work there? -steve
Re: Question about Hadoop 's Feature(s)
I implemented an RMI protocol using Hadoop IPC and implemented basic HMAC signing. It is I believe faster than public key private key because it uses a secret key and does not require public key provisioning like PKI would. Perhaps it would be a baseline way to sign the data. On Thu, Sep 25, 2008 at 7:47 AM, Steve Loughran <[EMAIL PROTECTED]> wrote: > Owen O'Malley wrote: >> >> On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: >> >>> We are developing a project and we are intend to use Hadoop to handle the >>> processing vast amount of data. But to convince our customers about the >>> using of Hadoop in our project, we must show them the advantages ( and maybe >>> ? the disadvantage ) when deploy the project with Hadoop compare to Oracle >>> Database Platform. >> >> The primary advantage of Hadoop is scalability. On an equivalent hardware >> budget, Hadoop can handle much much larger databases. We had a process that >> was run once a week on Oracle that is now run once an hour on Hadoop. >> Additionally, Hadoop scales out much much farther. We can store petabytes of >> data in a single Hadoop cluster and have jobs that read and generate 100's >> of terabytes. > > That said, what a database gives you -on the right hardware- is very fast > responses, especially if the indices are set up right and the data > denormalised when appropriate. There is also really good integration with > tools and application servers, with things like Java EE designed to make > running code against a database easy. > > Not using Oracle means you don't have to work with an Oracle DBA, which, in > my experience, can only be a good thing. DBAs and developers never seem to > see eye-to-eye. > > >> >> Hadoop only has very primitive security at the moment, although I expect >> that to change in the next 6 months. >> > > Right now you need to trust everyone else on the network where you run > hadoop to not be malicious; the filesystem and job tracker interfaces are > insecure. The forthcoming 0.19 release will ask who you are, but the far end > trusts you to be who you say you are. In that respect, it's as secure as NFS > over UDP. > > To secure Hadoop you'd probably need to > -sign every IPC request, with a CPU time cost at both ends. > -require some form of authentication for the HTTP exported parts of the > system, such as digest authentication, or issue lots of HTTPS private keys > and use that instead. Giving everyone a key management problem as well as > extra communications overhead. > > What is easier would be to lock down remote access to the filesystem/job > submission so that only authenticated users would be able to upload jobs and > data. The cluster would continue to trust everything else on its network, > but the system doesn't trust people to submit work unless they could prove > who they were. > >
Re: Question about Hadoop 's Feature(s)
Owen O'Malley wrote: On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. That said, what a database gives you -on the right hardware- is very fast responses, especially if the indices are set up right and the data denormalised when appropriate. There is also really good integration with tools and application servers, with things like Java EE designed to make running code against a database easy. Not using Oracle means you don't have to work with an Oracle DBA, which, in my experience, can only be a good thing. DBAs and developers never seem to see eye-to-eye. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. Right now you need to trust everyone else on the network where you run hadoop to not be malicious; the filesystem and job tracker interfaces are insecure. The forthcoming 0.19 release will ask who you are, but the far end trusts you to be who you say you are. In that respect, it's as secure as NFS over UDP. To secure Hadoop you'd probably need to -sign every IPC request, with a CPU time cost at both ends. -require some form of authentication for the HTTP exported parts of the system, such as digest authentication, or issue lots of HTTPS private keys and use that instead. Giving everyone a key management problem as well as extra communications overhead. What is easier would be to lock down remote access to the filesystem/job submission so that only authenticated users would be able to upload jobs and data. The cluster would continue to trust everything else on its network, but the system doesn't trust people to submit work unless they could prove who they were.
Re: Question about Hadoop 's Feature(s)
one of the major advantages Hadoop over Oracle: it saves you a lot of $$$ 2008/9/25 Trinh Tuan Cuong <[EMAIL PROTECTED]>: > Dear Mr/Mrs Owen O'Malley, > > First I would like to thank you much for your reply, it was somehow the > exact answer which I expected. As I read about the Query Language of > Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and > more... and I could see that Hadoop have an advantage SQL-like query > language. The most thing I was curous bout is Hadoop's security level > which is hard to find in any documents I searched. Like many of your > organization, we are believing in the fast growing of Hadoop and intend > to use it in our serious projects. Once again, thanks for the reply, now > I could tell our clients clearly about Hadoop. > > Best Regards. > > Tuan Cuong, Trinh. > [EMAIL PROTECTED] > Luvina Software Company. > Website : www.luvina.net > > -Original Message- > From: Owen O'Malley [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 11:27 PM > To: core-user@hadoop.apache.org > Subject: Re: Question about Hadoop 's Feature(s) > > On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: > >> We are developing a project and we are intend to use Hadoop to >> handle the processing vast amount of data. But to convince our >> customers about the using of Hadoop in our project, we must show >> them the advantages ( and maybe ? the disadvantage ) when deploy the >> project with Hadoop compare to Oracle Database Platform. > > The primary advantage of Hadoop is scalability. On an equivalent > hardware budget, Hadoop can handle much much larger databases. We had > a process that was run once a week on Oracle that is now run once an > hour on Hadoop. Additionally, Hadoop scales out much much farther. We > can store petabytes of data in a single Hadoop cluster and have jobs > that read and generate 100's of terabytes. > > The disadvantage of Hadoop is that it is still relatively young and > growing fast, so there are growing pains. Hadoop has recently gotten > higher level query languages like SQL (Pig, Hive, and Jaql), but still > doesn't have any fancy report generators. Hadoop only has very > primitive security at the moment, although I expect that to change in > the next 6 months. > > -- Owen > > >
RE: Question about Hadoop 's Feature(s)
Dear Mr/Mrs Owen O'Malley, First I would like to thank you much for your reply, it was somehow the exact answer which I expected. As I read about the Query Language of Hadoop, it is a combination of Pig_Pig Latin, Have,HBase,Jaql and more... and I could see that Hadoop have an advantage SQL-like query language. The most thing I was curous bout is Hadoop's security level which is hard to find in any documents I searched. Like many of your organization, we are believing in the fast growing of Hadoop and intend to use it in our serious projects. Once again, thanks for the reply, now I could tell our clients clearly about Hadoop. Best Regards. Tuan Cuong, Trinh. [EMAIL PROTECTED] Luvina Software Company. Website : www.luvina.net -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 11:27 PM To: core-user@hadoop.apache.org Subject: Re: Question about Hadoop 's Feature(s) On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: > We are developing a project and we are intend to use Hadoop to > handle the processing vast amount of data. But to convince our > customers about the using of Hadoop in our project, we must show > them the advantages ( and maybe ? the disadvantage ) when deploy the > project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. The disadvantage of Hadoop is that it is still relatively young and growing fast, so there are growing pains. Hadoop has recently gotten higher level query languages like SQL (Pig, Hive, and Jaql), but still doesn't have any fancy report generators. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. -- Owen
Re: Question about Hadoop 's Feature(s)
On Sep 24, 2008, at 1:50 AM, Trinh Tuan Cuong wrote: We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. The primary advantage of Hadoop is scalability. On an equivalent hardware budget, Hadoop can handle much much larger databases. We had a process that was run once a week on Oracle that is now run once an hour on Hadoop. Additionally, Hadoop scales out much much farther. We can store petabytes of data in a single Hadoop cluster and have jobs that read and generate 100's of terabytes. The disadvantage of Hadoop is that it is still relatively young and growing fast, so there are growing pains. Hadoop has recently gotten higher level query languages like SQL (Pig, Hive, and Jaql), but still doesn't have any fancy report generators. Hadoop only has very primitive security at the moment, although I expect that to change in the next 6 months. -- Owen
Question about Hadoop 's Feature(s)
Hi, We are developing a project and we are intend to use Hadoop to handle the processing vast amount of data. But to convince our customers about the using of Hadoop in our project, we must show them the advantages ( and maybe ? the disadvantage ) when deploy the project with Hadoop compare to Oracle Database Platform. So I would like to have a full feature(s) list of Hadoop, which would describe what features are integrated in the lastest version of Hadoop ( 0.18.1). Especially features related to the area of manipulating database – support user facilitating the use of database – and maybe some features about security,platform supports which are not related to the manipulating database process. P.S : I did google and yahoo the features list of Hadoop for several days but no clues or just a small features of HDFS or MapReduce, Hadoop on Demand stand-alone, what I really want is the completed feature list of the leastest version. Thanks in advance if any help, or links. Best Regards, Trịnh Tuấn Cường Luvina Software Company Website : www.luvina.net Address : 1001 Hoang Quoc Viet Street Email : [EMAIL PROTECTED],[EMAIL PROTECTED] Mobile : 097 4574 457
Re: Question about Hadoop
Thank you very much for explaining it to me, Ted.. Thats a great deal of info! I guess that could be how "Yahoo Webmap" is designed.. And for anyone trying to figure out the massiveness of Hadoop computing, http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/should give a good picture of a practical case. I was for a moment flabbergasted, and instantly fell in love with Hadoop! ;) On Sat, Jun 14, 2008 at 12:11 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Usually hadoop programs are not used interactively since what they excel at > is batch operations on very large collections of data. > > It is quite reasonable to store resulting data in hadoop and access those > results using hadoop. The cleanest way to do that is to have a > presentation > layer web server that has all of the UI on it and use http to access the > results file from hadoop via the namenodes data access URL. This works > well > where the results are not particularly voluminous. > > For large quantities of data such as the output of a web-crawl, it is > usually better to copy the output out of hadoop and into a clustered system > that supports high speed querying of the data. This clustered system might > be as simple as a redundant memcache or mySql farm or as fancy as a sharded > and replicated farm of text retrieval engines running under Solr. What > works for you will vary by what you need to do. > > You should keep in mind that hadoop was designed for very long MTBF (for a > cluster), but not designed for zero downtime operation. At the very least, > you will occasionally want to upgrade the cluster software and that > currently can't be done during normal operations. Combining hadoop (for > heavy duty computations) with a separate persistence layer (for high > availability web service) is a good hybrid. > > On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James <[EMAIL PROTECTED]> > wrote: > > > Thank you all for the responses. > > > > So in order to run a web-based application, I just need to put the part > of > > the application that needs to make use of distributed computation in > HDFS, > > and have the other web site related files access it via Hadoop streaming > ? > > > > Is that how Hadoop is used ? > > > > Sorry the question may sound too silly. > > > > Thank you. > > > > > > -- > ted >
Re: Question about Hadoop
Usually hadoop programs are not used interactively since what they excel at is batch operations on very large collections of data. It is quite reasonable to store resulting data in hadoop and access those results using hadoop. The cleanest way to do that is to have a presentation layer web server that has all of the UI on it and use http to access the results file from hadoop via the namenodes data access URL. This works well where the results are not particularly voluminous. For large quantities of data such as the output of a web-crawl, it is usually better to copy the output out of hadoop and into a clustered system that supports high speed querying of the data. This clustered system might be as simple as a redundant memcache or mySql farm or as fancy as a sharded and replicated farm of text retrieval engines running under Solr. What works for you will vary by what you need to do. You should keep in mind that hadoop was designed for very long MTBF (for a cluster), but not designed for zero downtime operation. At the very least, you will occasionally want to upgrade the cluster software and that currently can't be done during normal operations. Combining hadoop (for heavy duty computations) with a separate persistence layer (for high availability web service) is a good hybrid. On Thu, Jun 12, 2008 at 9:53 PM, Chanchal James <[EMAIL PROTECTED]> wrote: > Thank you all for the responses. > > So in order to run a web-based application, I just need to put the part of > the application that needs to make use of distributed computation in HDFS, > and have the other web site related files access it via Hadoop streaming ? > > Is that how Hadoop is used ? > > Sorry the question may sound too silly. > > Thank you. > > > On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > Once it is in HDFS, you already have backups (due to the replicated file > > system). > > > > Your problems with deleting the dfs data directory are likely > configuration > > problems combined with versioning of the data store (done to avoid > > confusion, but usually causes confusion). Once you get the configuration > > and operational issues sorted out, you shouldn't lose any data. > > > > On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]> > > wrote: > > > > > > > > If I keep all data in HDFS, is there anyway I can back it up regularly. > > > > > > > > > -- ted
Re: Question about Hadoop
Thank you all for the responses. So in order to run a web-based application, I just need to put the part of the application that needs to make use of distributed computation in HDFS, and have the other web site related files access it via Hadoop streaming ? Is that how Hadoop is used ? Sorry the question may sound too silly. Thank you. On Thu, Jun 12, 2008 at 7:49 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Once it is in HDFS, you already have backups (due to the replicated file > system). > > Your problems with deleting the dfs data directory are likely configuration > problems combined with versioning of the data store (done to avoid > confusion, but usually causes confusion). Once you get the configuration > and operational issues sorted out, you shouldn't lose any data. > > On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]> > wrote: > > > > > If I keep all data in HDFS, is there anyway I can back it up regularly. > > > > >
Re: Question about Hadoop
Once it is in HDFS, you already have backups (due to the replicated file system). Your problems with deleting the dfs data directory are likely configuration problems combined with versioning of the data store (done to avoid confusion, but usually causes confusion). Once you get the configuration and operational issues sorted out, you shouldn't lose any data. On Thu, Jun 12, 2008 at 10:15 AM, Chanchal James <[EMAIL PROTECTED]> wrote: > > If I keep all data in HDFS, is there anyway I can back it up regularly. > >
RE: Question about Hadoop
Looks good to me... -Original Message- From: Chanchal James [mailto:[EMAIL PROTECTED] Sent: Thursday, June 12, 2008 11:22 AM To: core-user@hadoop.apache.org Subject: Re: Question about Hadoop Haijun, I have most of the settings as default, but not tmp dir. I have the tmp dir set to "/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}". Is this a good location ? On Thu, Jun 12, 2008 at 12:59 PM, Haijun Cao <[EMAIL PROTECTED]> wrote: > > "While testing I had to delete the temporary "datastore" folder and > reformat > the file system a couple of times." > > Is it because you leave hadoop.tmp.dir and other .dir parameter as > default? Try to set hadoop.tmp.dir to a dir not under /tmp. > > > hadoop.tmp.dir > /tmp/hadoop-${user.name} > A base for other temporary directories. > > > Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name: > > > dfs.name.dir > ${hadoop.tmp.dir}/dfs/name > > > Haijun > > -Original Message- > From: Chanchal James [mailto:[EMAIL PROTECTED] > Sent: Thursday, June 12, 2008 10:16 AM > To: core-user@hadoop.apache.org > Subject: Re: Question about Hadoop > > Thanks Lohit for the info. I have one more question. > If I keep all data in HDFS, is there anyway I can back it up regularly. > While testing I had to delete the temporary "datastore" folder and > reformat > the file system a couple of times. So while using Hadoop in a real > environment, what are the chances of such software side uncorrectable > problems to occur. Can we correct it without a reformat ? I cannot > afford to > loose the data I plan to put in HDFS. > > Thank you. > > On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote: > > > Ideally what you would want is your data to be on HDFS and run your > > map/reduce jobs on that data. Hadoop framework splits you data and > feeds in > > those splits to each map or reduce task. One problem with Image files > is > > that you will not be able to split them. Alternatively people have > done > > this, they wrap Image files within xml and create huge files which has > > multiple image files in them. Hadoop offers something called streaming > using > > which you will be able to split the files at xml boundry and feed it > to your > > map/reduce tasks. Streaming also enables you to use any code like > > perl/php/c++. > > Check info about streaming here > > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html > > And information about parsing XML files in streaming in here > > > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse > +XML+documents+using+streaming%3F<http://hadoop.apache.org/core/docs/r0. 17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F> > > > > Thanks, > > Lohit > > > > - Original Message > > From: Chanchal James <[EMAIL PROTECTED]> > > To: core-user@hadoop.apache.org > > Sent: Thursday, June 12, 2008 9:42:46 AM > > Subject: Question about Hadoop > > > > Hi, > > > > > > > > I have a question about Hadoop. I am a beginner and just testing > Hadoop. > > > > Would like to know how a php application would benefit from this, say > an > > > > application that needs to work on a large number of image files. Do I > have > > to > > > > store the application in HDFS always, or do I just copy it to HDFS > when > > > > needed, do the processing, and then copy it back to the local file > system ? > > > > Is that the case with the data files too ? Once I have Hadoop running, > do I > > > > keep all data & application files in HDFS always, and not use local > file > > > > system storage ? > > > > > > > > Thank you. > > > > >
Re: Question about Hadoop
Haijun, I have most of the settings as default, but not tmp dir. I have the tmp dir set to "/usr/local/hadoop/hadoop-datastore/hadoop-${user.name}". Is this a good location ? On Thu, Jun 12, 2008 at 12:59 PM, Haijun Cao <[EMAIL PROTECTED]> wrote: > > "While testing I had to delete the temporary "datastore" folder and > reformat > the file system a couple of times." > > Is it because you leave hadoop.tmp.dir and other .dir parameter as > default? Try to set hadoop.tmp.dir to a dir not under /tmp. > > > hadoop.tmp.dir > /tmp/hadoop-${user.name} > A base for other temporary directories. > > > Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name: > > > dfs.name.dir > ${hadoop.tmp.dir}/dfs/name > > > Haijun > > -Original Message- > From: Chanchal James [mailto:[EMAIL PROTECTED] > Sent: Thursday, June 12, 2008 10:16 AM > To: core-user@hadoop.apache.org > Subject: Re: Question about Hadoop > > Thanks Lohit for the info. I have one more question. > If I keep all data in HDFS, is there anyway I can back it up regularly. > While testing I had to delete the temporary "datastore" folder and > reformat > the file system a couple of times. So while using Hadoop in a real > environment, what are the chances of such software side uncorrectable > problems to occur. Can we correct it without a reformat ? I cannot > afford to > loose the data I plan to put in HDFS. > > Thank you. > > On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote: > > > Ideally what you would want is your data to be on HDFS and run your > > map/reduce jobs on that data. Hadoop framework splits you data and > feeds in > > those splits to each map or reduce task. One problem with Image files > is > > that you will not be able to split them. Alternatively people have > done > > this, they wrap Image files within xml and create huge files which has > > multiple image files in them. Hadoop offers something called streaming > using > > which you will be able to split the files at xml boundry and feed it > to your > > map/reduce tasks. Streaming also enables you to use any code like > > perl/php/c++. > > Check info about streaming here > > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html > > And information about parsing XML files in streaming in here > > > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse > +XML+documents+using+streaming%3F<http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F> > > > > Thanks, > > Lohit > > > > - Original Message > > From: Chanchal James <[EMAIL PROTECTED]> > > To: core-user@hadoop.apache.org > > Sent: Thursday, June 12, 2008 9:42:46 AM > > Subject: Question about Hadoop > > > > Hi, > > > > > > > > I have a question about Hadoop. I am a beginner and just testing > Hadoop. > > > > Would like to know how a php application would benefit from this, say > an > > > > application that needs to work on a large number of image files. Do I > have > > to > > > > store the application in HDFS always, or do I just copy it to HDFS > when > > > > needed, do the processing, and then copy it back to the local file > system ? > > > > Is that the case with the data files too ? Once I have Hadoop running, > do I > > > > keep all data & application files in HDFS always, and not use local > file > > > > system storage ? > > > > > > > > Thank you. > > > > >
RE: Question about Hadoop
"While testing I had to delete the temporary "datastore" folder and reformat the file system a couple of times." Is it because you leave hadoop.tmp.dir and other .dir parameter as default? Try to set hadoop.tmp.dir to a dir not under /tmp. hadoop.tmp.dir /tmp/hadoop-${user.name} A base for other temporary directories. Dfs.name.dir is by default under ${hadoop.tmp.dir}/dfs/name: dfs.name.dir ${hadoop.tmp.dir}/dfs/name Haijun -Original Message- From: Chanchal James [mailto:[EMAIL PROTECTED] Sent: Thursday, June 12, 2008 10:16 AM To: core-user@hadoop.apache.org Subject: Re: Question about Hadoop Thanks Lohit for the info. I have one more question. If I keep all data in HDFS, is there anyway I can back it up regularly. While testing I had to delete the temporary "datastore" folder and reformat the file system a couple of times. So while using Hadoop in a real environment, what are the chances of such software side uncorrectable problems to occur. Can we correct it without a reformat ? I cannot afford to loose the data I plan to put in HDFS. Thank you. On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote: > Ideally what you would want is your data to be on HDFS and run your > map/reduce jobs on that data. Hadoop framework splits you data and feeds in > those splits to each map or reduce task. One problem with Image files is > that you will not be able to split them. Alternatively people have done > this, they wrap Image files within xml and create huge files which has > multiple image files in them. Hadoop offers something called streaming using > which you will be able to split the files at xml boundry and feed it to your > map/reduce tasks. Streaming also enables you to use any code like > perl/php/c++. > Check info about streaming here > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html > And information about parsing XML files in streaming in here > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse +XML+documents+using+streaming%3F > > Thanks, > Lohit > > - Original Message > From: Chanchal James <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, June 12, 2008 9:42:46 AM > Subject: Question about Hadoop > > Hi, > > > > I have a question about Hadoop. I am a beginner and just testing Hadoop. > > Would like to know how a php application would benefit from this, say an > > application that needs to work on a large number of image files. Do I have > to > > store the application in HDFS always, or do I just copy it to HDFS when > > needed, do the processing, and then copy it back to the local file system ? > > Is that the case with the data files too ? Once I have Hadoop running, do I > > keep all data & application files in HDFS always, and not use local file > > system storage ? > > > > Thank you. > >
Re: Question about Hadoop
Thanks Lohit for the info. I have one more question. If I keep all data in HDFS, is there anyway I can back it up regularly. While testing I had to delete the temporary "datastore" folder and reformat the file system a couple of times. So while using Hadoop in a real environment, what are the chances of such software side uncorrectable problems to occur. Can we correct it without a reformat ? I cannot afford to loose the data I plan to put in HDFS. Thank you. On Thu, Jun 12, 2008 at 12:02 PM, lohit <[EMAIL PROTECTED]> wrote: > Ideally what you would want is your data to be on HDFS and run your > map/reduce jobs on that data. Hadoop framework splits you data and feeds in > those splits to each map or reduce task. One problem with Image files is > that you will not be able to split them. Alternatively people have done > this, they wrap Image files within xml and create huge files which has > multiple image files in them. Hadoop offers something called streaming using > which you will be able to split the files at xml boundry and feed it to your > map/reduce tasks. Streaming also enables you to use any code like > perl/php/c++. > Check info about streaming here > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html > And information about parsing XML files in streaming in here > http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F > > Thanks, > Lohit > > - Original Message > From: Chanchal James <[EMAIL PROTECTED]> > To: core-user@hadoop.apache.org > Sent: Thursday, June 12, 2008 9:42:46 AM > Subject: Question about Hadoop > > Hi, > > > > I have a question about Hadoop. I am a beginner and just testing Hadoop. > > Would like to know how a php application would benefit from this, say an > > application that needs to work on a large number of image files. Do I have > to > > store the application in HDFS always, or do I just copy it to HDFS when > > needed, do the processing, and then copy it back to the local file system ? > > Is that the case with the data files too ? Once I have Hadoop running, do I > > keep all data & application files in HDFS always, and not use local file > > system storage ? > > > > Thank you. > >
Re: Question about Hadoop
Ideally what you would want is your data to be on HDFS and run your map/reduce jobs on that data. Hadoop framework splits you data and feeds in those splits to each map or reduce task. One problem with Image files is that you will not be able to split them. Alternatively people have done this, they wrap Image files within xml and create huge files which has multiple image files in them. Hadoop offers something called streaming using which you will be able to split the files at xml boundry and feed it to your map/reduce tasks. Streaming also enables you to use any code like perl/php/c++. Check info about streaming here http://hadoop.apache.org/core/docs/r0.17.0/streaming.html And information about parsing XML files in streaming in here http://hadoop.apache.org/core/docs/r0.17.0/streaming.html#How+do+I+parse+XML+documents+using+streaming%3F Thanks, Lohit - Original Message From: Chanchal James <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, June 12, 2008 9:42:46 AM Subject: Question about Hadoop Hi, I have a question about Hadoop. I am a beginner and just testing Hadoop. Would like to know how a php application would benefit from this, say an application that needs to work on a large number of image files. Do I have to store the application in HDFS always, or do I just copy it to HDFS when needed, do the processing, and then copy it back to the local file system ? Is that the case with the data files too ? Once I have Hadoop running, do I keep all data & application files in HDFS always, and not use local file system storage ? Thank you.
Question about Hadoop
Hi, I have a question about Hadoop. I am a beginner and just testing Hadoop. Would like to know how a php application would benefit from this, say an application that needs to work on a large number of image files. Do I have to store the application in HDFS always, or do I just copy it to HDFS when needed, do the processing, and then copy it back to the local file system ? Is that the case with the data files too ? Once I have Hadoop running, do I keep all data & application files in HDFS always, and not use local file system storage ? Thank you.
Re: question about hadoop 0.17 upgrade
־Զ 写道: upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. How can I do with it? Thanks! I have use the “start-dfs.sh –upgrade” command to upgrade the filesystem below is the error log: 2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = test180.sqa/192.168.207.180 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523; compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008 / 2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=9001 2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9001: starting 2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50030 2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 9001 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: /home/hadoop/HadoopInstall/tmp/mapred/system org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot delete /home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode. The ratio of reported blocks 0. has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519) at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498) at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at org.apache.hadoop.dfs.DFSClient.delete(DFSClient.java:535) at org.apache.hadoop.dfs.DistributedFileSyste
question about hadoop 0.17 upgrade
upgrade 0.16.3 to 0.17, error appears when start dfs and jobtracker. How can I do with it? Thanks! I have use the “start-dfs.sh �Cupgrade” command to upgrade the filesystem below is the error log: 2008-05-26 09:14:33,463 INFO org.apache.hadoop.mapred.JobTracker: STARTUP_MSG: / STARTUP_MSG: Starting JobTracker STARTUP_MSG: host = test180.sqa/192.168.207.180 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.0 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 656523; compiled by 'hadoopqa' on Thu May 15 07:22:55 UTC 2008 / 2008-05-26 09:14:33,567 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=JobTracker, port=9001 2008-05-26 09:14:33,610 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 9001: starting 2008-05-26 09:14:33,611 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 5 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 6 on 9001: starting 2008-05-26 09:14:33,612 INFO org.apache.hadoop.ipc.Server: IPC Server handler 7 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 8 on 9001: starting 2008-05-26 09:14:33,613 INFO org.apache.hadoop.ipc.Server: IPC Server handler 9 on 9001: starting 2008-05-26 09:14:33,664 INFO org.mortbay.util.Credential: Checking Resource aliases 2008-05-26 09:14:33,733 INFO org.mortbay.http.HttpServer: Version Jetty/5.1. 4 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-05-26 09:14:33,734 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-05-26 09:14:33,962 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:33,998 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-05-26 09:14:34,000 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50030 2008-05-26 09:14:34,000 INFO org.mortbay.util.Container: Started [EMAIL PROTECTED] 2008-05-26 09:14:34,002 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker up at: 9001 2008-05-26 09:14:34,003 INFO org.apache.hadoop.mapred.JobTracker: JobTracker webserver: 50030 2008-05-26 09:14:34,096 INFO org.apache.hadoop.mapred.JobTracker: problem cleaning system directory: /home/hadoop/HadoopInstall/tmp/mapred/system org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot delete /home/hadoop/HadoopInstall/tmp/mapred/system. Name node is in safe mode. The ratio of reported blocks 0. has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.deleteInternal(FSNamesystem.java:1519) at org.apache.hadoop.dfs.FSNamesystem.delete(FSNamesystem.java:1498) at org.apache.hadoop.dfs.NameNode.delete(NameNode.java:383) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocati onHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHand ler.java:59) at org.apache.hadoop.dfs.$Proxy4.delete(Unknown Source) at o
Re: One Simple Question About Hadoop DFS
On Sun, 23 Mar 2008, Chaman Singh Verma wrote: > Hello, > > I am exploring Hadoop and MapReduce and I have one very simple question. > > I have 500GB dataset on my local disk and I have written both Map-Reduce > functions. Now how should I start ? > > 1. I copy the data from local disk to DFS. I have configured DFS with 100 > machines. I hope that it will split the file on 100 nodes ( With some > replications). > Yes. You need to copy the data from your local disk to the DFS. It will split the files based on the dfs block size (dfs.block.size). Default block size is 64MB and hence there would be 8000 blocks. > 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify >less than 100 then, will be blocks migrate ? If the blocks don't migrate > then >why this functions is provided to the users ? Why number of Tasks is not >taken from the startup script ? > Again here the max number of maps is bounded by the dfs block size. Hence in the default case you would have 8000 maps (unless you have your own input format). > 3. If I specify more than 100, then will load balancing be done automatically > or user have to specify that also. > In short its the dfs block size along with the input format that controls the number of maps. The number of maps given to the framework is used as a hint. Sometimes it doesn't matter what value is passed. Amar > Perhaps these are very simple questions, but I think that MapReduce > simplifies lots of things ( Compared to MPI Based Programming ) that for > beginners like me have difficult time to understand the model. > > csv > > > > > - > Never miss a thing. Make Yahoo your homepage.
One Simple Question About Hadoop DFS
Hello, I am exploring Hadoop and MapReduce and I have one very simple question. I have 500GB dataset on my local disk and I have written both Map-Reduce functions. Now how should I start ? 1. I copy the data from local disk to DFS. I have configured DFS with 100 machines. I hope that it will split the file on 100 nodes ( With some replications). 2. For MapReduce should I specify 100 nodes for SetMaxMapTask(). If I specify less than 100 then, will be blocks migrate ? If the blocks don't migrate then why this functions is provided to the users ? Why number of Tasks is not taken from the startup script ? 3. If I specify more than 100, then will load balancing be done automatically or user have to specify that also. Perhaps these are very simple questions, but I think that MapReduce simplifies lots of things ( Compared to MPI Based Programming ) that for beginners like me have difficult time to understand the model. csv - Never miss a thing. Make Yahoo your homepage.