Deprecated ... damaged?

2010-12-15 Thread maha
Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. So the number of Maps is equal to the number of input files. Yet, what I get is that each split contains multiple paths of input files,

Re: Hive import question

2010-12-15 Thread Mark
Exactly what I was looking for. Thanks On 12/14/10 8:53 PM, 김영우 wrote: Hi Mark, You can use 'External table' in Hive. http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL http://wiki.apache.org/hadoop/Hive/LanguageManual/DDLHive external table does not move or delete files. - Youngwoo

Hive Partitioning

2010-12-15 Thread Mark
Can someone explain what partitioning is and why it would be used.. example? Thanks

Re: Hive Partitioning

2010-12-15 Thread Hari Sreekumar
Hi Mark, I think you will get more and better responses for this question in the hive mailing lists. (http://hive.apache.org/mailing_lists.html) Regards, Hari On Wed, Dec 15, 2010 at 8:52 PM, Mark static.void@gmail.com wrote: Can someone explain what partitioning is and why it would

Re: Hadoop Certification Progamme

2010-12-15 Thread Steve Loughran
On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your Hadoop expertise. Well, there's always providing enough patches to the code to get commit rights :)

Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran
On 10/12/10 06:14, Amandeep Khurana wrote: Mark, Using EMR makes it very easy to start a cluster and add/reduce capacity as and when required. There are certain optimizations that make EMR an attractive choice as compared to building your own cluster out. Using EMR also ensures you are using a

Re: Question from a Desperate Java Newbie

2010-12-15 Thread Steve Loughran
On 10/12/10 09:08, Edward Choi wrote: I was wrong. It wasn't because of the read once free policy. I tried again with Java first again and this time it didn't work. I looked up google and found the Http Client you mentioned. It is the one provided by apache, right? I guess I will have to try

Re: Hadoop Certification Progamme

2010-12-15 Thread Konstantin Boudnik
Hey, commit rights won't give you a nice looking certificate, would it? ;) On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote: On 09/12/10 03:40, Matthew John wrote: Hi all,. Is there any valid Hadoop Certification available ? Something which adds credibility to your

Re: Hadoop Certification Progamme

2010-12-15 Thread James Seigel
But it would give you the right creds for people that you’d want to work for :) James On 2010-12-15, at 10:26 AM, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote: On

Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran
On 09/12/10 18:57, Aaron Eng wrote: Pros: - Easier to build out and tear down clusters vs. using physical machines in a lab - Easier to scale up and scale down a cluster as needed Cons: - Reliability. In my experience I've had machines die, had machines fail to start up, had network outages

Re: Hadoop Certification Progamme

2010-12-15 Thread Steve Loughran
On 15/12/10 17:26, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Depends on what hudson says about the quality of your patches. I mean, if every commit breaks the build, it soon becomes public

Hadoop File system performance counters

2010-12-15 Thread abhishek sharma
Hi, What do the following two File Sytem counters associated with a job (and printed at the end of a job's execution) represent? FILE_BYTES_READ and FILE_BYTES_WRITTEN How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN? Thanks, Abhishek

Re: Hadoop File system performance counters

2010-12-15 Thread James Seigel
They represent the amount data written to the physical disk on the slaves, as intermediate files before or during the shuffle phase. Where HDFS bytes are the files written back into hdfs containing the data you wish to see. J On 2010-12-15, at 10:37 AM, abhishek sharma wrote: Hi, What do

Re: Hadoop Certification Progamme

2010-12-15 Thread Konstantin Boudnik
On Wed, Dec 15, 2010 at 09:35, Steve Loughran ste...@apache.org wrote: On 15/12/10 17:26, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Depends on what hudson says about the quality of your patches. I mean, if every commit breaks the

Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the system classes for tasks' classpath, to be included in CDH3b4 release, please put in a vote on https://issues.cloudera.org/browse/DISTRO-64. The details about the fix are here:

Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Todd Lipcon
Hey Roger, Thanks for the input. We're glad to see the community expressing their priorities on our JIRA. I noticed you also sent this to cdh-user, which is the more appropriate list. CDH-specific discussion should be kept off the ASF lists like common-user, which is meant for discussion about

Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Mahadev Konar
Hi Roger, Please use cloudera¹s mailing list for communications regarding cloudera distributions. Thanks mahadev On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like MR-1938 patch (see link below), Ability for having user's classes take precedence over the

Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
Got it. On Wed, Dec 15, 2010 at 10:47 AM, Todd Lipcon t...@cloudera.com wrote: Hey Roger, Thanks for the input. We're glad to see the community expressing their priorities on our JIRA. I noticed you also sent this to cdh-user, which is the more appropriate list. CDH-specific discussion

Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
Apologies. On Wed, Dec 15, 2010 at 10:48 AM, Mahadev Konar maha...@yahoo-inc.comwrote: Hi Roger, Please use cloudera¹s mailing list for communications regarding cloudera distributions. Thanks mahadev On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote: If you would like

Re: Deprecated ... damaged?

2010-12-15 Thread maha
Actually, I just realized that numSplits can't be modified definitely. Even if I write numSplits = 5, it's just a hint. Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one file/split ?? or is that also just a hint? Maha On Dec 15, 2010, at 2:13 AM, maha wrote: Hi

Re: Deprecated ... damaged?

2010-12-15 Thread Allen Wittenauer
On Dec 15, 2010, at 2:13 AM, maha wrote: Hi everyone, Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed to put each file from the input directory in a SEPARATE split. Is there some reason you don't just use normal InputFormat with an extremely high

Re: Hadoop Certification Progamme

2010-12-15 Thread Allen Wittenauer
On Dec 15, 2010, at 9:26 AM, Konstantin Boudnik wrote: Hey, commit rights won't give you a nice looking certificate, would it? ;) Isn't that what Photoshop is for?

Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
W. P., How are you running your Reducer? Is everything running in standalone mode (all mappers/reducers in the same process as the launching application)? Or are you running this in pseudo-distributed mode or on a remote cluster? Depending on the application's configuration, log4j configuration

Re: How do I log from my map/reduce application?

2010-12-15 Thread W.P. McNeill
I'm running on a cluster. I'm trying to write to the log files on the cluster machines, the ones that are visible through the jobtracker web interface. The log4j file I gave excerpts from is a central one for the cluster. On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com

Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
How is the central log4j file made available to the tasks? After you make your changes to the configuration file, does it help if you restart the task trackers? You could also try setting the log level programmatically in your void setup(Context) method: @Override protected void setup(Context

Re: Deprecated ... damaged?

2010-12-15 Thread maha
Hi Allen and thanks for responding .. You're answer actually gave me another clue, I set numSplits = numFiles*100; in myInputFormat and it worked :D ... Do you think there are side effects for doing that? Thank you, Maha On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote:

Is it possible to change from IterableVALUEIN to ResettableIteratorVALUEIN in Reducer?

2010-12-15 Thread ChingShen
Hi all, I just want to know is it possible to allow an iterator to be repeatedly reused? Shen

Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread sandeep
HI , I am trying to upgrade hadoop ,as part of this i have set Two environment variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL . After this i have executed the following command % NEW_HADOOP_INSTALL/bin/start-dfs -upgrade But namenode didnot started as it was throwing

Re: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread Adarsh Sharma
sandeep wrote: HI , I am trying to upgrade hadoop ,as part of this i have set Two environment variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL . After this i have executed the following command % NEW_HADOOP_INSTALL/bin/start-dfs -upgrade But namenode didnot started as it

Re: Question from a Desperate Java Newbie

2010-12-15 Thread edward choi
I totally obey the robots.txt since I am only fetching RSS feeds :-) I implemented my crawler with HttpClient and it is working fine. I often get messages about Cookie rejected, but am able to fetch news articles anyway. I guess the default java.net client is the stateful client you mentioned.

Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
That clears the confusion. Thanks. There are just too many tools for Hadoop :-) 2010/12/14 Alejandro Abdelnur t...@cloudera.com Ed, Actually Oozie is quite different from Cascading. * Cascading allows you to write 'queries' using a Java API and they get translated into MR jobs. * Oozie

Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
This one doesn't seem so complex for even a newbie like myself. Thanks!!! 2010/12/14 Ted Dunning tdunn...@maprtech.com Or even simpler, try Azkaban: http://sna-projects.com/azkaban/ On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote: Thanks for the tip. I took a look at

RE: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread sandeep
Thanks adarsh. i have done the followign for NEW_HADOOP_INSTALL (new hadoop version installation )i have set same values for dfs.name.dir and fs.checkpoint which i have configured in OLD_HADOOP_INSTALL(old hadoop version installation) Now it is working Thanks sandeep

Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
The first recommendation (gluing all my command line apps) is what I am currently using. The other ones you mentioned are just out of my league right now, since I am quite new to Java world, not to mention JRuby, Groovy, Jython, etc. But when I get comfortable with the environment and start to

How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread sravankumar
Hi, Does any one know how to speed up datanode decommissioning and what are all the configurations related to the decommissioning. How to Speed Up Data Transfer from the Datanode getting decommissioned. Thanks Regards, Sravan kumar.

Re: How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread Adarsh Sharma
sravankumar wrote: Hi, Does any one know how to speed up datanode decommissioning and what are all the configurations related to the decommissioning. How to Speed Up Data Transfer from the Datanode getting decommissioned. Thanks Regards, Sravan kumar.

Re: How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread baggio liu
You can use metasave to check the bottleneck of decommion speed, If the bottleneck is the speed of namenode dispatch. You can tuning dfs.max-repl-streams to a large number (default 2). If there're many timeout block replication tasks from pending replication queue to need replication , you can