Deprecated ... damaged?
Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is # of input files. Is it because MultiFileInputFormat is deprecated? In my implemented myMultiFileInputFormat I have only the following: public RecordReaderLongWritable, Text getRecordReader(InputSplit split, JobConf job, Reporter reporter){ return (new myRecordReader((MultiFileSplit) split)); } Yet, in myRecordReader, for example one split has the following; /tmp/input/file1:0+300 /tmp/input/file2:0+199 instead of each line in its own split. Why? Any clues? Thank you, Maha
Re: Hive import question
Exactly what I was looking for. Thanks On 12/14/10 8:53 PM, 김영우 wrote: Hi Mark, You can use 'External table' in Hive. http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL http://wiki.apache.org/hadoop/Hive/LanguageManual/DDLHive external table does not move or delete files. - Youngwoo 2010/12/15 Markstatic.void@gmail.com When I load a file from HDFS into hive i notice that the original file has been removed. Is there anyway to prevent this? If not, how can I got back and dump it as a file again? Thanks
Hive Partitioning
Can someone explain what partitioning is and why it would be used.. example? Thanks
Re: Hive Partitioning
Hi Mark, I think you will get more and better responses for this question in the hive mailing lists. (http://hive.apache.org/mailing_lists.html) Regards, Hari On Wed, Dec 15, 2010 at 8:52 PM, Mark static.void@gmail.com wrote: Can someone explain what partitioning is and why it would be used.. example? Thanks
Re: Hadoop Certification Progamme
On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Well, there's always providing enough patches to the code to get commit rights :)
Re: Hadoop/Elastic MR on AWS
On 10/12/10 06:14, Amandeep Khurana wrote: Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR also ensures you are using a production quality, stable system backed by the EMR engineers. You can always use bootstrap actions to put your own tweaked version of Hadoop in there if you want to do that. Also, you don't have to tear down your cluster after every job. You can set the alive option when you start your cluster and it will stay there even after your Hadoop job completes. If you face any issues with EMR, send me a mail offline and I'll be happy to help. How different is your distro from the apache version?
Re: Question from a Desperate Java Newbie
On 10/12/10 09:08, Edward Choi wrote: I was wrong. It wasn't because of the read once free policy. I tried again with Java first again and this time it didn't work. I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks! httpclient is good, HtmlUnit has a very good client that can simulate things like a full web browser with cookies, but that may be overkill. NYT's read once policy uses cookies to verify that you are there for the first day not logged in, for later days you get 302'd unless you delete the cookie, so stateful clients are bad. What you may have been hit by is whatever robot trap they have -if you generate too much load and don't follow the robots.txt rules they may detect this and push back
Re: Hadoop Certification Progamme
Hey, commit rights won't give you a nice looking certificate, would it? ;) On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote: On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Well, there's always providing enough patches to the code to get commit rights :)
Re: Hadoop Certification Progamme
But it would give you the right creds for people that you’d want to work for :) James On 2010-12-15, at 10:26 AM, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote: On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Well, there's always providing enough patches to the code to get commit rights :)
Re: Hadoop/Elastic MR on AWS
On 09/12/10 18:57, Aaron Eng wrote: Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages between Amazon instances, etc. These problems have occurred at a far more significant rate than any physical lab I have ever administered. - Money. You get charged for problems with their system. Need to add storage space to a node? That means renting space from EBS which you then need to actually spend time formatting to ext3 so you can use it with Hadoop. So every time you want to use storage, you're paying Amazon to format it because you can't tell EBS that you want an ext3 volume. - Visibility. Amazon loves to report that all their services are working properly on their website, meanwhile, the reality is that they only report issues if they are extremely major. Just yesterday they reported increased latency on their us-east-1 region. In reality, increased latency means 50% of my Amazon API calls were timing out, I could not create new instances and for about 2 hours I could not destroy the instances I had already spun up. Hows that for ya? Paying them for machines that they won't let me terminate... that's the harsh reality of all VMs. you need to monitor and stamp on things that misbehave. The nice thing is: it's easy to do this, just get HTTP status pages and kill any VM This is not a fault of EC2: any VM infra has this feature. You can't control where your VMs come up, you are penalised by other cpu-heavy machines on the same server, amazon throttle the smaller machines a bit. But you -don't pay for cluster time you don't need -don't pay for ingress/egress for data you generate in the vendor's infrastructure (just storage) -can be very agile with cluster size. I have a talk on this topic for the curious, discussing a UI that is a bit more agile, but even there we deploy agents to every node to keep an eye on the state of the cluster. http://www.slideshare.net/steve_l/farming-hadoop-inthecloud http://blip.tv/file/3809976 Hadoop is designed to work well in a large-scale static cluster: fixed machines, with the reactions to client to server failure failure: spin and those of servers -blacklist clients- being the right ones to leave ops in control. In a virtual world you want the clients to see (somehow) if the master nodes have moved, you want the servers to kill the misbehaving VMs to save money, and then create new ones. -Steve
Re: Hadoop Certification Progamme
On 15/12/10 17:26, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Depends on what hudson says about the quality of your patches. I mean, if every commit breaks the build, it soon becomes public
Hadoop File system performance counters
Hi, What do the following two File Sytem counters associated with a job (and printed at the end of a job's execution) represent? FILE_BYTES_READ and FILE_BYTES_WRITTEN How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN? Thanks, Abhishek
Re: Hadoop File system performance counters
They represent the amount data written to the physical disk on the slaves, as intermediate files before or during the shuffle phase. Where HDFS bytes are the files written back into hdfs containing the data you wish to see. J On 2010-12-15, at 10:37 AM, abhishek sharma wrote: Hi, What do the following two File Sytem counters associated with a job (and printed at the end of a job's execution) represent? FILE_BYTES_READ and FILE_BYTES_WRITTEN How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN? Thanks, Abhishek
Re: Hadoop Certification Progamme
On Wed, Dec 15, 2010 at 09:35, Steve Loughran ste...@apache.org wrote: On 15/12/10 17:26, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Depends on what hudson says about the quality of your patches. I mean, if every commit breaks the build, it soon becomes public Right, the key words of my post were 'nice looking'.
Inclusion of MR-1938 in CDH3b4
If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here: https://issues.apache.org/jira/browse/MAPREDUCE-1938 Roger
Re: Inclusion of MR-1938 in CDH3b4
Hey Roger, Thanks for the input. We're glad to see the community expressing their priorities on our JIRA. I noticed you also sent this to cdh-user, which is the more appropriate list. CDH-specific discussion should be kept off the ASF lists like common-user, which is meant for discussion about the upstream project. -Todd On Wed, Dec 15, 2010 at 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here: https://issues.apache.org/jira/browse/MAPREDUCE-1938 Roger -- Todd Lipcon Software Engineer, Cloudera
Re: Inclusion of MR-1938 in CDH3b4
Hi Roger, Please use cloudera¹s mailing list for communications regarding cloudera distributions. Thanks mahadev On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here: https://issues.apache.org/jira/browse/MAPREDUCE-1938 Roger
Re: Inclusion of MR-1938 in CDH3b4
Got it. On Wed, Dec 15, 2010 at 10:47 AM, Todd Lipcon t...@cloudera.com wrote: Hey Roger, Thanks for the input. We're glad to see the community expressing their priorities on our JIRA. I noticed you also sent this to cdh-user, which is the more appropriate list. CDH-specific discussion should be kept off the ASF lists like common-user, which is meant for discussion about the upstream project. -Todd On Wed, Dec 15, 2010 at 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here: https://issues.apache.org/jira/browse/MAPREDUCE-1938 Roger -- Todd Lipcon Software Engineer, Cloudera
Re: Inclusion of MR-1938 in CDH3b4
Apologies. On Wed, Dec 15, 2010 at 10:48 AM, Mahadev Konar maha...@yahoo-inc.comwrote: Hi Roger, Please use cloudera¹s mailing list for communications regarding cloudera distributions. Thanks mahadev On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here: https://issues.apache.org/jira/browse/MAPREDUCE-1938 Roger
Re: Deprecated ... damaged?
Actually, I just realized that numSplits can't be modified definitely. Even if I write numSplits = 5, it's just a hint. Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one file/split ?? or is that also just a hint? Maha On Dec 15, 2010, at 2:13 AM, maha wrote: Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files, hence # of maps is # of input files. Is it because MultiFileInputFormat is deprecated? In my implemented myMultiFileInputFormat I have only the following: public RecordReaderLongWritable, Text getRecordReader(InputSplit split, JobConf job, Reporter reporter){ return (new myRecordReader((MultiFileSplit) split)); } Yet, in myRecordReader, for example one split has the following; /tmp/input/file1:0+300 /tmp/input/file2:0+199 instead of each line in its own split. Why? Any clues? Thank you, Maha
Re: Deprecated ... damaged?
On Dec 15, 2010, at 2:13 AM, maha wrote: Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
Re: Hadoop Certification Progamme
On Dec 15, 2010, at 9:26 AM, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Isn't that what Photoshop is for?
Re: How do I log from my map/reduce application?
W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application)? Or are you running this in pseudo-distributed mode or on a remote cluster? Depending on the application's configuration, log4j configuration could be read from one of many different places. Furthermore, where are you expecting your output? If you're running in pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will not emit output back to the console of the launching application. That only happens in local mode. In the distributed flavors, you'll see a different file for each task attempt containing its log output, on the machine where the task executed. These files can be accessed through the web UI at http://jobtracker:50030/ -- click on the job, then the task, then the task attempt, then syslog in the right-most column. - Aaron On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com wrote: I would like to use Hadoop's Log4j infrastructure to do logging from my map/reduce application. I think I've got everything set up correctly, but I am still unable to specify the logging level I want. By default Hadoop is set up to log at level INFO. The first line of its log4j.properties file looks like this: hadoop.root.logger=INFO,console I have an application whose reducer looks like this: package com.me; public class MyReducer... extends Reducer... { private static Logger logger = Logger.getLogger(MyReducer.class.getName()); ... protected void reduce(...) { logger.debug(My message); ... } } I've added the following line to the Hadoop log4j.properties file: log4j.logger.com.me.MyReducer=DEBUG I expect the Hadoop system to log at level INFO, but my application to log at level DEBUG, so that I see My message in the logs for the reducer task. However, my application does not produce any log4j output. If I change the line in my reducer to read logger.info(My message) the message does get logged, so somehow I'm failing to specify that log level for this class. I've also tried changing the log4j line for my app to read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result. I've been through the Hadoop and log4j documentation and I can't figure out what I'm doing wrong. Any suggestions? Thanks.
Re: How do I log from my map/reduce application?
I'm running on a cluster. I'm trying to write to the log files on the cluster machines, the ones that are visible through the jobtracker web interface. The log4j file I gave excerpts from is a central one for the cluster. On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com wrote: W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application)? Or are you running this in pseudo-distributed mode or on a remote cluster? Depending on the application's configuration, log4j configuration could be read from one of many different places. Furthermore, where are you expecting your output? If you're running in pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will not emit output back to the console of the launching application. That only happens in local mode. In the distributed flavors, you'll see a different file for each task attempt containing its log output, on the machine where the task executed. These files can be accessed through the web UI at http://jobtracker:50030/ -- click on the job, then the task, then the task attempt, then syslog in the right-most column. - Aaron On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com wrote: I would like to use Hadoop's Log4j infrastructure to do logging from my map/reduce application. I think I've got everything set up correctly, but I am still unable to specify the logging level I want. By default Hadoop is set up to log at level INFO. The first line of its log4j.properties file looks like this: hadoop.root.logger=INFO,console I have an application whose reducer looks like this: package com.me; public class MyReducer... extends Reducer... { private static Logger logger = Logger.getLogger(MyReducer.class.getName()); ... protected void reduce(...) { logger.debug(My message); ... } } I've added the following line to the Hadoop log4j.properties file: log4j.logger.com.me.MyReducer=DEBUG I expect the Hadoop system to log at level INFO, but my application to log at level DEBUG, so that I see My message in the logs for the reducer task. However, my application does not produce any log4j output. If I change the line in my reducer to read logger.info(My message) the message does get logged, so somehow I'm failing to specify that log level for this class. I've also tried changing the log4j line for my app to read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result. I've been through the Hadoop and log4j documentation and I can't figure out what I'm doing wrong. Any suggestions? Thanks.
Re: How do I log from my map/reduce application?
How is the central log4j file made available to the tasks? After you make your changes to the configuration file, does it help if you restart the task trackers? You could also try setting the log level programmatically in your void setup(Context) method: @Override protected void setup(Context context) { logger.setLevel(Level.DEBUG); } - Aaron On Wed, Dec 15, 2010 at 2:23 PM, W.P. McNeill bill...@gmail.com wrote: I'm running on a cluster. I'm trying to write to the log files on the cluster machines, the ones that are visible through the jobtracker web interface. The log4j file I gave excerpts from is a central one for the cluster. On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com wrote: W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application)? Or are you running this in pseudo-distributed mode or on a remote cluster? Depending on the application's configuration, log4j configuration could be read from one of many different places. Furthermore, where are you expecting your output? If you're running in pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will not emit output back to the console of the launching application. That only happens in local mode. In the distributed flavors, you'll see a different file for each task attempt containing its log output, on the machine where the task executed. These files can be accessed through the web UI at http://jobtracker:50030/ -- click on the job, then the task, then the task attempt, then syslog in the right-most column. - Aaron On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com wrote: I would like to use Hadoop's Log4j infrastructure to do logging from my map/reduce application. I think I've got everything set up correctly, but I am still unable to specify the logging level I want. By default Hadoop is set up to log at level INFO. The first line of its log4j.properties file looks like this: hadoop.root.logger=INFO,console I have an application whose reducer looks like this: package com.me; public class MyReducer... extends Reducer... { private static Logger logger = Logger.getLogger(MyReducer.class.getName()); ... protected void reduce(...) { logger.debug(My message); ... } } I've added the following line to the Hadoop log4j.properties file: log4j.logger.com.me.MyReducer=DEBUG I expect the Hadoop system to log at level INFO, but my application to log at level DEBUG, so that I see My message in the logs for the reducer task. However, my application does not produce any log4j output. If I change the line in my reducer to read logger.info(My message) the message does get logged, so somehow I'm failing to specify that log level for this class. I've also tried changing the log4j line for my app to read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result. I've been through the Hadoop and log4j documentation and I can't figure out what I'm doing wrong. Any suggestions? Thanks.
Re: Deprecated ... damaged?
Hi Allen and thanks for responding .. You're answer actually gave me another clue, I set numSplits = numFiles*100; in myInputFormat and it worked :D ... Do you think there are side effects for doing that? Thank you, Maha On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote: On Dec 15, 2010, at 2:13 AM, maha wrote: Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. Is there some reason you don't just use normal InputFormat with an extremely high min.split.size?
Is it possible to change from IterableVALUEIN to ResettableIteratorVALUEIN in Reducer?
Hi all, I just want to know is it possible to allow an iterator to be repeatedly reused? Shen
Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading
HI , I am trying to upgrade hadoop ,as part of this i have set Two environment variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL . After this i have executed the following command % NEW_HADOOP_INSTALL/bin/start-dfs -upgrade But namenode didnot started as it was throwing Inconsistent state exception as the dfs.name.dir is not present Here My question is while upgrading do we need to have the same old configurations like dfs.name.dir..etc Or Do i need to format that namenode first and then start upgrading? Please let me know Thanks sandeep *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!
Re: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading
sandeep wrote: HI , I am trying to upgrade hadoop ,as part of this i have set Two environment variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL . After this i have executed the following command % NEW_HADOOP_INSTALL/bin/start-dfs -upgrade But namenode didnot started as it was throwing Inconsistent state exception as the dfs.name.dir is not present Here My question is while upgrading do we need to have the same old configurations like dfs.name.dir..etc Or Do i need to format that namenode first and then start upgrading? Please let me know Thanks sandeep *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! Sandeep This Error occurs due to new namespace issue in Hadoop. Did u copy dfs.name.dir and fs.checkpoint dir to new Hadoop directory. Namenode Format would cause u to loose all previous data. Best Regards Adarsh Sharma
Re: Question from a Desperate Java Newbie
I totally obey the robots.txt since I am only fetching RSS feeds :-) I implemented my crawler with HttpClient and it is working fine. I often get messages about Cookie rejected, but am able to fetch news articles anyway. I guess the default java.net client is the stateful client you mentioned. Thanks for the tip!! Ed 2010년 12월 16일 오전 2:18, Steve Loughran ste...@apache.org님의 말: On 10/12/10 09:08, Edward Choi wrote: I was wrong. It wasn't because of the read once free policy. I tried again with Java first again and this time it didn't work. I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try that one now. Thanks! httpclient is good, HtmlUnit has a very good client that can simulate things like a full web browser with cookies, but that may be overkill. NYT's read once policy uses cookies to verify that you are there for the first day not logged in, for later days you get 302'd unless you delete the cookie, so stateful clients are bad. What you may have been hit by is whatever robot trap they have -if you generate too much load and don't follow the robots.txt rules they may detect this and push back
Re: how to run jobs every 30 minutes?
That clears the confusion. Thanks. There are just too many tools for Hadoop :-) 2010/12/14 Alejandro Abdelnur t...@cloudera.com Ed, Actually Oozie is quite different from Cascading. * Cascading allows you to write 'queries' using a Java API and they get translated into MR jobs. * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG (workflow jobs) and has timer+data dependency triggers (coordinator jobs). Regards. Alejandro On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
Re: how to run jobs every 30 minutes?
This one doesn't seem so complex for even a newbie like myself. Thanks!!! 2010/12/14 Ted Dunning tdunn...@maprtech.com Or even simpler, try Azkaban: http://sna-projects.com/azkaban/ On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com
RE: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading
Thanks adarsh. i have done the followign for NEW_HADOOP_INSTALL (new hadoop version installation )i have set same values for dfs.name.dir and fs.checkpoint which i have configured in OLD_HADOOP_INSTALL(old hadoop version installation) Now it is working Thanks sandeep *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -Original Message- From: Adarsh Sharma [mailto:adarsh.sha...@orkash.com] Sent: Thursday, December 16, 2010 11:42 AM To: common-user@hadoop.apache.org Subject: Re: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading sandeep wrote: HI , I am trying to upgrade hadoop ,as part of this i have set Two environment variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL . After this i have executed the following command % NEW_HADOOP_INSTALL/bin/start-dfs -upgrade But namenode didnot started as it was throwing Inconsistent state exception as the dfs.name.dir is not present Here My question is while upgrading do we need to have the same old configurations like dfs.name.dir..etc Or Do i need to format that namenode first and then start upgrading? Please let me know Thanks sandeep *** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! Sandeep This Error occurs due to new namespace issue in Hadoop. Did u copy dfs.name.dir and fs.checkpoint dir to new Hadoop directory. Namenode Format would cause u to loose all previous data. Best Regards Adarsh Sharma
Re: how to run jobs every 30 minutes?
The first recommendation (gluing all my command line apps) is what I am currently using. The other ones you mentioned are just out of my league right now, since I am quite new to Java world, not to mention JRuby, Groovy, Jython, etc. But when I get comfortable with the environment and start to look for more options I'll refer to your message. Thanks for the advanced info :-) 2010/12/15 Chris K Wensel ch...@wensel.net I see it this way. You can glue a bunch of discrete command line apps together that may or may not have dependencies between one another in a new syntax. which is darn nice if you already have a bunch of discrete ready to run command line apps sitting around that need to be strung together, that can't be used as libraries and instantiated through their APIs. Or, you can string all your work together through the APIs with a turing complete language and run them all from a single command line interface (and hand that to cron, or some other tool). In this case you can use Java, or easier languages like JRuby, Groovy, Jython, Clojure, etc which were designed for this purpose. (They don't run on the cluster, they only run Hadoop client side). Think ant vs graddle (or any other build tool that uses a scripting language and not a configuration file) if you want a concrete example. Cascading itself is a query API (and query planner). But it also exposes to the user the ability to run discrete 'processes' in dependency order for you. Either Cascading (Hadoop) Flows or Riffle annotated process objects. They all can be intermingled and managed from the same dependency scheduler. Cascading has one, and Riffle has one. So you can run Flow - Mahout - Pig - Mahout - Flow - shell - whattheheckever from the same application. Cascading also has the ability to only run 'stale' processes. Think 'make' file. When re-running a job where only one file of many has changed, this is a big win. I personally like parameterizing my applications via the command line and letting my cli options drive the workflows. for example, my testing, integration, production environments are much different, so its very easy to drive specific runs of the jobs by changing a cli arg. (args4j makes this darn simple) if I am chaining multiple CLI apps into a bigger production app, parameterizing that I suspect will be error prone, esp if the input/output data points (jdbc vs file) are different in different contexts. you can find Riffle here, https://github.com/cwensel/riffle (its Apache Licensed, contributions welcomed) ckw On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote: Ed, Actually Oozie is quite different from Cascading. * Cascading allows you to write 'queries' using a Java API and they get translated into MR jobs. * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG (workflow jobs) and has timer+data dependency triggers (coordinator jobs). Regards. Alejandro On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at it. Looks similar to Cascading I guess...? Anyway thanks for the info!! Ed 2010/12/8 Alejandro Abdelnur t...@cloudera.com Or, if you want to do it in a reliable way you could use an Oozie coordinator job. On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote: My mistake. Come to think about it, you are right, I can just make an infinite loop inside the Hadoop application. Thanks for the reply. 2010/12/7 Harsh J qwertyman...@gmail.com Hi, On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com wrote: Hi, I'm planning to crawl a certain web site every 30 minutes. How would I get it done in Hadoop? In pure Java, I used Thread.sleep() method, but I guess this won't work in Hadoop. Why wouldn't it? You need to manage your post-job logic mostly, but sleep and resubmission should work just fine. Or if it could work, could anyone show me an example? Ed. -- Harsh J www.harshj.com -- Chris K Wensel ch...@concurrentinc.com http://www.concurrentinc.com -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading
How to Speed Up Decommissioning progress of a datanode.
Hi, Does any one know how to speed up datanode decommissioning and what are all the configurations related to the decommissioning. How to Speed Up Data Transfer from the Datanode getting decommissioned. Thanks Regards, Sravan kumar.
Re: How to Speed Up Decommissioning progress of a datanode.
sravankumar wrote: Hi, Does any one know how to speed up datanode decommissioning and what are all the configurations related to the decommissioning. How to Speed Up Data Transfer from the Datanode getting decommissioned. Thanks Regards, Sravan kumar. Check the attachment --Adarsh Balancing Data among Datanodes : HDFS will not move blocks to new nodes automatically. However, newly created files will likely have their blocks placed on the new nodes. There are several ways to rebalance the cluster manually. -Select a subset of files that take up a good percentage of your disk space; copy them to new locations in HDFS; remove the old copies of the files; rename the new copies to their original names. -A simpler way, with no interruption of service, is to turn up the replication of files, wait for transfers to stabilize, and then turn the replication back down. -Yet another way to re-balance blocks is to turn off the data-node, which is full, wait until its blocks are replicated, and then bring it back again. The over-replicated blocks will be randomly removed from different nodes, so you really get them rebalanced not just removed from the current node. -Finally, you can use the bin/start-balancer.sh command to run a balancing process to move blocks around the cluster automatically. bash-3.2$ bin/start-balancer.sh or $ bin/hadoop balancer -threshold 10 starting balancer, logging to /home/hadoop/project/hadoop-0.20.2/bin/../logs/hadoop-hadoop-balancer-ws-test.out Time Stamp Iteration# Bytes Already Moved Bytes Left To Move Bytes Being Moved The cluster is balanced. Exiting... Balancing took 350.0 milliseconds A cluster is balanced iff there is no under-capactiy or over-capacity data nodes in the cluster. An under-capacity data node is a node that its %used space is less than avg_%used_space-threshhold. An over-capacity data node is a node that its %used space is greater than avg_%used_space+threshhold. A threshold is user configurable. A default value could be 20% of % used space.
Re: How to Speed Up Decommissioning progress of a datanode.
You can use metasave to check the bottleneck of decommion speed, If the bottleneck is the speed of namenode dispatch. You can tuning dfs.max-repl-streams to a large number (default 2). If there're many timeout block replication tasks from pending replication queue to need replication , you can tuning dfs.replication.pending.timeout.sec to a smaller numer, to make block replcation more positive. Pay attention!! Please check your hadoop version, if block transfer has no speed limit, the bandwidth may be stuff full Thanks Best regards Baggio 2010/12/16 sravankumar sravanku...@huawei.com Hi, Does any one know how to speed up datanode decommissioning and what are all the configurations related to the decommissioning. How to Speed Up Data Transfer from the Datanode getting decommissioned. Thanks Regards, Sravan kumar.