Deprecated ... damaged?

2010-12-15 Thread maha
Hi everyone,

  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is supposed 
to put each file from the input directory in a SEPARATE split. So the number of 
Maps is equal to the number of input files. Yet, what I get is that each split 
contains multiple paths of input files, hence # of maps is  # of input files. 
Is it because MultiFileInputFormat is deprecated?

  In my implemented myMultiFileInputFormat I have only the following:

public RecordReaderLongWritable, Text getRecordReader(InputSplit split, 
JobConf job, Reporter reporter){
return (new myRecordReader((MultiFileSplit) split));
}

Yet, in myRecordReader, for example one split has the following;
  
   /tmp/input/file1:0+300
/tmp/input/file2:0+199  

  instead of each line in its own split.

Why? Any clues?

  Thank you,
  Maha
  

Re: Hive import question

2010-12-15 Thread Mark

Exactly what I was looking for. Thanks

On 12/14/10 8:53 PM, 김영우 wrote:

Hi Mark,

You can use 'External table' in Hive.
http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL

http://wiki.apache.org/hadoop/Hive/LanguageManual/DDLHive external table
does not move or delete files.

- Youngwoo

2010/12/15 Markstatic.void@gmail.com


When I load a file from HDFS into hive i notice that the original file has
been removed. Is there anyway to prevent this? If not, how can I got back
and dump it as a file again? Thanks



Hive Partitioning

2010-12-15 Thread Mark
Can someone explain what partitioning is and why it would be used.. 
example? Thanks


Re: Hive Partitioning

2010-12-15 Thread Hari Sreekumar
Hi Mark,

I think you will get more and better responses for this question in the
hive mailing lists. (http://hive.apache.org/mailing_lists.html)

Regards,
Hari

On Wed, Dec 15, 2010 at 8:52 PM, Mark static.void@gmail.com wrote:

 Can someone explain what partitioning is and why it would be used..
 example? Thanks



Re: Hadoop Certification Progamme

2010-12-15 Thread Steve Loughran

On 09/12/10 03:40, Matthew John wrote:

Hi all,.

Is there any valid Hadoop Certification available ? Something which adds
credibility to your Hadoop expertise.



Well, there's always providing enough patches to the code to get commit 
rights :)


Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran

On 10/12/10 06:14, Amandeep Khurana wrote:

Mark,

Using EMR makes it very easy to start a cluster and add/reduce capacity as
and when required. There are certain optimizations that make EMR an
attractive choice as compared to building your own cluster out. Using EMR
also ensures you are using a production quality, stable system backed by the
EMR engineers. You can always use bootstrap actions to put your own tweaked
version of Hadoop in there if you want to do that.

Also, you don't have to tear down your cluster after every job. You can set
the alive option when you start your cluster and it will stay there even
after your Hadoop job completes.

If you face any issues with EMR, send me a mail offline and I'll be happy to
help.



How different is your distro from the apache version?


Re: Question from a Desperate Java Newbie

2010-12-15 Thread Steve Loughran
On 10/12/10 09:08, Edward Choi wrote:
 I was wrong. It wasn't because of the read once free policy. I tried again 
 with Java first again and this time it didn't work.
 I looked up google and found the Http Client you mentioned. It is the one 
 provided by apache, right? I guess I will have to try that one now. Thanks!
 

httpclient is good, HtmlUnit has a very good client that can simulate
things like a full web browser with cookies, but that may be overkill.

NYT's read once policy uses cookies to verify that you are there for the
first day not logged in, for later days you get 302'd unless you delete
the cookie, so stateful clients are bad.

What you may have been hit by is whatever robot trap they have -if you
generate too much load and don't follow the robots.txt rules they may
detect this and push back



Re: Hadoop Certification Progamme

2010-12-15 Thread Konstantin Boudnik
Hey, commit rights won't give you a nice looking certificate, would it? ;)

On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote:
 On 09/12/10 03:40, Matthew John wrote:

 Hi all,.

 Is there any valid Hadoop Certification available ? Something which adds
 credibility to your Hadoop expertise.


 Well, there's always providing enough patches to the code to get commit
 rights :)



Re: Hadoop Certification Progamme

2010-12-15 Thread James Seigel
But it would give you the right creds for people that you’d want to work for :)

James


On 2010-12-15, at 10:26 AM, Konstantin Boudnik wrote:

 Hey, commit rights won't give you a nice looking certificate, would it? ;)
 
 On Wed, Dec 15, 2010 at 09:12, Steve Loughran ste...@apache.org wrote:
 On 09/12/10 03:40, Matthew John wrote:
 
 Hi all,.
 
 Is there any valid Hadoop Certification available ? Something which adds
 credibility to your Hadoop expertise.
 
 
 Well, there's always providing enough patches to the code to get commit
 rights :)
 



Re: Hadoop/Elastic MR on AWS

2010-12-15 Thread Steve Loughran

On 09/12/10 18:57, Aaron Eng wrote:

Pros:
- Easier to build out and tear down clusters vs. using physical machines in
a lab
- Easier to scale up and scale down a cluster as needed

Cons:
- Reliability.  In my experience I've had machines die, had machines fail to
start up, had network outages between Amazon instances, etc.  These problems
have occurred at a far more significant rate than any physical lab I have
ever administered.
- Money. You get charged for problems with their system.  Need to add
storage space to a node?  That means renting space from EBS which you then
need to actually spend time formatting to ext3 so you can use it with
Hadoop.  So every time you want to use storage, you're paying Amazon to
format it because you can't tell EBS that you want an ext3 volume.
- Visibility.  Amazon loves to report that all their services are working
properly on their website, meanwhile, the reality is that they only report
issues if they are extremely major.  Just yesterday they reported increased
latency on their us-east-1 region.  In reality, increased latency means

50% of my Amazon API calls were timing out, I could not create new

instances and for about 2 hours I could not destroy the instances I had
already spun up.  Hows that for ya?  Paying them for machines that they
won't let me terminate...



that's the harsh reality of all VMs. you need to monitor and stamp on 
things that misbehave. The nice thing is: it's easy to do this, just get 
HTTP status pages and kill any VM


This is not a fault of EC2: any VM infra has this feature. You can't 
control where your VMs come up, you are penalised by other cpu-heavy 
machines on the same server, amazon throttle the smaller machines a bit.


But you
 -don't pay for cluster time you don't need
 -don't pay for ingress/egress for data you generate in the vendor's 
infrastructure (just storage)

 -can be very agile with cluster size.

I have a talk on this topic for the curious, discussing a UI that is a 
bit more agile, but even there we deploy agents to every node to keep an 
eye on the state of the cluster.


http://www.slideshare.net/steve_l/farming-hadoop-inthecloud
http://blip.tv/file/3809976

Hadoop is designed to work well in a large-scale static cluster: fixed 
machines, with the reactions to client to server failure failure: spin 
and those of servers -blacklist clients- being the right ones to leave 
ops in control. In a virtual world you want the clients to see (somehow) 
if the master nodes have moved, you want the servers to kill the 
misbehaving VMs to save money, and then create new ones.


-Steve


Re: Hadoop Certification Progamme

2010-12-15 Thread Steve Loughran

On 15/12/10 17:26, Konstantin Boudnik wrote:

Hey, commit rights won't give you a nice looking certificate, would it? ;)



Depends on what hudson says about the quality of your patches. I mean, 
if every commit breaks the build, it soon becomes public


Hadoop File system performance counters

2010-12-15 Thread abhishek sharma
Hi,

What do the following two File Sytem counters associated with a job
(and printed at the end of a job's execution) represent?

FILE_BYTES_READ and FILE_BYTES_WRITTEN

How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN?

Thanks,
Abhishek


Re: Hadoop File system performance counters

2010-12-15 Thread James Seigel
They represent the amount data written to the physical disk on the slaves, as 
intermediate files before or during the shuffle phase.  Where HDFS bytes are 
the files written back into hdfs containing the data you wish to see.

J

On 2010-12-15, at 10:37 AM, abhishek sharma wrote:

 Hi,
 
 What do the following two File Sytem counters associated with a job
 (and printed at the end of a job's execution) represent?
 
 FILE_BYTES_READ and FILE_BYTES_WRITTEN
 
 How are they different from the HDFS_BYTES_READ and HDFS_BYTES_WRITTEN?
 
 Thanks,
 Abhishek



Re: Hadoop Certification Progamme

2010-12-15 Thread Konstantin Boudnik
On Wed, Dec 15, 2010 at 09:35, Steve Loughran ste...@apache.org wrote:
 On 15/12/10 17:26, Konstantin Boudnik wrote:

 Hey, commit rights won't give you a nice looking certificate, would it? ;)


 Depends on what hudson says about the quality of your patches. I mean, if
 every commit breaks the build, it soon becomes public

Right, the key words of my post were 'nice looking'.


Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
If you would like MR-1938 patch (see link below), Ability for having user's
classes take precedence over the system classes for tasks' classpath, to
be included in CDH3b4 release, please put in a vote on
https://issues.cloudera.org/browse/DISTRO-64.

The details about the fix are here:
https://issues.apache.org/jira/browse/MAPREDUCE-1938

Roger


Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Todd Lipcon
Hey Roger,

Thanks for the input. We're glad to see the community expressing their
priorities on our JIRA.

I noticed you also sent this to cdh-user, which is the more
appropriate list. CDH-specific discussion should be kept off the ASF
lists like common-user, which is meant for discussion about the
upstream project.

-Todd

On Wed, Dec 15, 2010 at 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote:
 If you would like MR-1938 patch (see link below), Ability for having user's
 classes take precedence over the system classes for tasks' classpath, to
 be included in CDH3b4 release, please put in a vote on
 https://issues.cloudera.org/browse/DISTRO-64.

 The details about the fix are here:
 https://issues.apache.org/jira/browse/MAPREDUCE-1938

 Roger




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Mahadev Konar
Hi Roger, 
 Please use cloudera¹s mailing list for communications regarding cloudera
distributions.

Thanks
mahadev


On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote:

 If you would like MR-1938 patch (see link below), Ability for having user's
 classes take precedence over the system classes for tasks' classpath, to
 be included in CDH3b4 release, please put in a vote on
 https://issues.cloudera.org/browse/DISTRO-64.
 
 The details about the fix are here:
 https://issues.apache.org/jira/browse/MAPREDUCE-1938
 
 Roger
 



Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
Got it.

On Wed, Dec 15, 2010 at 10:47 AM, Todd Lipcon t...@cloudera.com wrote:

 Hey Roger,

 Thanks for the input. We're glad to see the community expressing their
 priorities on our JIRA.

 I noticed you also sent this to cdh-user, which is the more
 appropriate list. CDH-specific discussion should be kept off the ASF
 lists like common-user, which is meant for discussion about the
 upstream project.

 -Todd

 On Wed, Dec 15, 2010 at 10:43 AM, Roger Smith rogersmith1...@gmail.com
 wrote:
  If you would like MR-1938 patch (see link below), Ability for having
 user's
  classes take precedence over the system classes for tasks' classpath, to
  be included in CDH3b4 release, please put in a vote on
  https://issues.cloudera.org/browse/DISTRO-64.
 
  The details about the fix are here:
  https://issues.apache.org/jira/browse/MAPREDUCE-1938
 
  Roger
 



 --
 Todd Lipcon
 Software Engineer, Cloudera



Re: Inclusion of MR-1938 in CDH3b4

2010-12-15 Thread Roger Smith
Apologies.

On Wed, Dec 15, 2010 at 10:48 AM, Mahadev Konar maha...@yahoo-inc.comwrote:

 Hi Roger,
  Please use cloudera¹s mailing list for communications regarding cloudera
 distributions.

 Thanks
 mahadev


 On 12/15/10 10:43 AM, Roger Smith rogersmith1...@gmail.com wrote:

  If you would like MR-1938 patch (see link below), Ability for having
 user's
  classes take precedence over the system classes for tasks' classpath, to
  be included in CDH3b4 release, please put in a vote on
  https://issues.cloudera.org/browse/DISTRO-64.
 
  The details about the fix are here:
  https://issues.apache.org/jira/browse/MAPREDUCE-1938
 
  Roger
 




Re: Deprecated ... damaged?

2010-12-15 Thread maha
Actually, I just realized that numSplits can't be modified definitely. Even 
if I write numSplits = 5, it's just a hint. 

Then how come MultiFileInputFormat claims to use MultiFileSplit to contain one 
file/split ?? or is that also just a hint?

Maha

On Dec 15, 2010, at 2:13 AM, maha wrote:

 Hi everyone,
 
  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is 
 supposed to put each file from the input directory in a SEPARATE split. So 
 the number of Maps is equal to the number of input files. Yet, what I get is 
 that each split contains multiple paths of input files, hence # of maps is  
 # of input files. Is it because MultiFileInputFormat is deprecated?
 
  In my implemented myMultiFileInputFormat I have only the following:
 
 public RecordReaderLongWritable, Text getRecordReader(InputSplit split, 
 JobConf job, Reporter reporter){
   return (new myRecordReader((MultiFileSplit) split));
   }
 
 Yet, in myRecordReader, for example one split has the following;
 
   /tmp/input/file1:0+300
/tmp/input/file2:0+199  
 
  instead of each line in its own split.
 
Why? Any clues?
 
  Thank you,
  Maha



Re: Deprecated ... damaged?

2010-12-15 Thread Allen Wittenauer

On Dec 15, 2010, at 2:13 AM, maha wrote:

 Hi everyone,
 
  Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is 
 supposed to put each file from the input directory in a SEPARATE split.


Is there some reason you don't just use normal InputFormat with an 
extremely high min.split.size?



Re: Hadoop Certification Progamme

2010-12-15 Thread Allen Wittenauer

On Dec 15, 2010, at 9:26 AM, Konstantin Boudnik wrote:

 Hey, commit rights won't give you a nice looking certificate, would it? ;)


Isn't that what Photoshop is for?



Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
W. P.,

How are you running your Reducer? Is everything running in standalone mode
(all mappers/reducers in the same process as the launching application)? Or
are you running this in pseudo-distributed mode or on a remote cluster?

Depending on the application's configuration, log4j configuration could be
read from one of many different places.

Furthermore, where are you expecting your output? If you're running in
pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will
not emit output back to the console of the launching application.  That only
happens in local mode. In the distributed flavors, you'll see a different
file for each task attempt containing its log output, on the machine where
the task executed. These files can be accessed through the web UI at
http://jobtracker:50030/ -- click on the job, then the task, then the task
attempt, then syslog in the right-most column.

- Aaron

On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com wrote:

 I would like to use Hadoop's Log4j infrastructure to do logging from my
 map/reduce application.  I think I've got everything set up correctly, but
 I
 am still unable to specify the logging level I want.

 By default Hadoop is set up to log at level INFO.  The first line of its
 log4j.properties file looks like this:

 hadoop.root.logger=INFO,console


 I have an application whose reducer looks like this:

 package com.me;

 public class MyReducer... extends Reducer... {
   private static Logger logger =
 Logger.getLogger(MyReducer.class.getName());

   ...
   protected void reduce(...) {
   logger.debug(My message);
   ...
   }
 }


 I've added the following line to the Hadoop log4j.properties file:

 log4j.logger.com.me.MyReducer=DEBUG


 I expect the Hadoop system to log at level INFO, but my application to log
 at level DEBUG, so that I see My message in the logs for the reducer
 task.
  However, my application does not produce any log4j output.  If I change
 the
 line in my reducer to read logger.info(My message) the message does get
 logged, so somehow I'm failing to specify that log level for this class.

 I've also tried changing the log4j line for my app to
 read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result.

 I've been through the Hadoop and log4j documentation and I can't figure out
 what I'm doing wrong.  Any suggestions?

 Thanks.



Re: How do I log from my map/reduce application?

2010-12-15 Thread W.P. McNeill
I'm running on a cluster.  I'm trying to write to the log files on the
cluster machines, the ones that are visible through the jobtracker web
interface.

The log4j file I gave excerpts from is a central one for the cluster.

On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com wrote:

 W. P.,

 How are you running your Reducer? Is everything running in standalone mode
 (all mappers/reducers in the same process as the launching application)? Or
 are you running this in pseudo-distributed mode or on a remote cluster?

 Depending on the application's configuration, log4j configuration could be
 read from one of many different places.

 Furthermore, where are you expecting your output? If you're running in
 pseudo-distributed (or fully distributed) mode, mapper / reducer tasks will
 not emit output back to the console of the launching application.  That
 only
 happens in local mode. In the distributed flavors, you'll see a different
 file for each task attempt containing its log output, on the machine where
 the task executed. These files can be accessed through the web UI at
 http://jobtracker:50030/ -- click on the job, then the task, then the task
 attempt, then syslog in the right-most column.

 - Aaron

 On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com wrote:

  I would like to use Hadoop's Log4j infrastructure to do logging from my
  map/reduce application.  I think I've got everything set up correctly,
 but
  I
  am still unable to specify the logging level I want.
 
  By default Hadoop is set up to log at level INFO.  The first line of its
  log4j.properties file looks like this:
 
  hadoop.root.logger=INFO,console
 
 
  I have an application whose reducer looks like this:
 
  package com.me;
 
  public class MyReducer... extends Reducer... {
private static Logger logger =
  Logger.getLogger(MyReducer.class.getName());
 
...
protected void reduce(...) {
logger.debug(My message);
...
}
  }
 
 
  I've added the following line to the Hadoop log4j.properties file:
 
  log4j.logger.com.me.MyReducer=DEBUG
 
 
  I expect the Hadoop system to log at level INFO, but my application to
 log
  at level DEBUG, so that I see My message in the logs for the reducer
  task.
   However, my application does not produce any log4j output.  If I change
  the
  line in my reducer to read logger.info(My message) the message does
 get
  logged, so somehow I'm failing to specify that log level for this class.
 
  I've also tried changing the log4j line for my app to
  read log4j.logger.com.me.MyReducer=DEBUG,console and get the same result.
 
  I've been through the Hadoop and log4j documentation and I can't figure
 out
  what I'm doing wrong.  Any suggestions?
 
  Thanks.
 



Re: How do I log from my map/reduce application?

2010-12-15 Thread Aaron Kimball
How is the central log4j file made available to the tasks? After you make
your changes to the configuration file, does it help if you restart the task
trackers?

You could also try setting the log level programmatically in your void
setup(Context) method:

@Override
protected void setup(Context context) {
  logger.setLevel(Level.DEBUG);
}

- Aaron

On Wed, Dec 15, 2010 at 2:23 PM, W.P. McNeill bill...@gmail.com wrote:

 I'm running on a cluster.  I'm trying to write to the log files on the
 cluster machines, the ones that are visible through the jobtracker web
 interface.

 The log4j file I gave excerpts from is a central one for the cluster.

 On Wed, Dec 15, 2010 at 1:38 PM, Aaron Kimball akimbal...@gmail.com
 wrote:

  W. P.,
 
  How are you running your Reducer? Is everything running in standalone
 mode
  (all mappers/reducers in the same process as the launching application)?
 Or
  are you running this in pseudo-distributed mode or on a remote cluster?
 
  Depending on the application's configuration, log4j configuration could
 be
  read from one of many different places.
 
  Furthermore, where are you expecting your output? If you're running in
  pseudo-distributed (or fully distributed) mode, mapper / reducer tasks
 will
  not emit output back to the console of the launching application.  That
  only
  happens in local mode. In the distributed flavors, you'll see a different
  file for each task attempt containing its log output, on the machine
 where
  the task executed. These files can be accessed through the web UI at
  http://jobtracker:50030/ -- click on the job, then the task, then the
 task
  attempt, then syslog in the right-most column.
 
  - Aaron
 
  On Mon, Dec 13, 2010 at 10:05 AM, W.P. McNeill bill...@gmail.com
 wrote:
 
   I would like to use Hadoop's Log4j infrastructure to do logging from my
   map/reduce application.  I think I've got everything set up correctly,
  but
   I
   am still unable to specify the logging level I want.
  
   By default Hadoop is set up to log at level INFO.  The first line of
 its
   log4j.properties file looks like this:
  
   hadoop.root.logger=INFO,console
  
  
   I have an application whose reducer looks like this:
  
   package com.me;
  
   public class MyReducer... extends Reducer... {
 private static Logger logger =
   Logger.getLogger(MyReducer.class.getName());
  
 ...
 protected void reduce(...) {
 logger.debug(My message);
 ...
 }
   }
  
  
   I've added the following line to the Hadoop log4j.properties file:
  
   log4j.logger.com.me.MyReducer=DEBUG
  
  
   I expect the Hadoop system to log at level INFO, but my application to
  log
   at level DEBUG, so that I see My message in the logs for the reducer
   task.
However, my application does not produce any log4j output.  If I
 change
   the
   line in my reducer to read logger.info(My message) the message does
  get
   logged, so somehow I'm failing to specify that log level for this
 class.
  
   I've also tried changing the log4j line for my app to
   read log4j.logger.com.me.MyReducer=DEBUG,console and get the same
 result.
  
   I've been through the Hadoop and log4j documentation and I can't figure
  out
   what I'm doing wrong.  Any suggestions?
  
   Thanks.
  
 



Re: Deprecated ... damaged?

2010-12-15 Thread maha
Hi Allen and thanks for responding ..

   You're answer actually gave me another clue, I set numSplits = numFiles*100; 
in myInputFormat and it worked :D ... Do you think there are side effects for 
doing that?

   Thank you,

   Maha

On Dec 15, 2010, at 12:16 PM, Allen Wittenauer wrote:

 
 On Dec 15, 2010, at 2:13 AM, maha wrote:
 
 Hi everyone,
 
 Using Hadoop-0.20.2, I'm trying to use MultiFileInputFormat which is 
 supposed to put each file from the input directory in a SEPARATE split.
 
 
   Is there some reason you don't just use normal InputFormat with an 
 extremely high min.split.size?
 



Is it possible to change from IterableVALUEIN to ResettableIteratorVALUEIN in Reducer?

2010-12-15 Thread ChingShen
Hi all,

I just want to know is it possible to allow an iterator to be repeatedly
reused?

Shen


Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread sandeep
 

 

HI ,

 

I am trying to upgrade hadoop ,as part of this i have set Two environment
variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL .

 

After this i have executed the following command %
NEW_HADOOP_INSTALL/bin/start-dfs -upgrade

 

But namenode didnot started as it was throwing Inconsistent state exception
as the dfs.name.dir is not present

 

Here My question is while upgrading do we need to have the same old
configurations like dfs.name.dir..etc

 

Or Do i need to format that namenode first and then start upgrading?

 

Please let me know

 

Thanks

sandeep

 


***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 



Re: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread Adarsh Sharma

sandeep wrote:
 

 


HI ,

 


I am trying to upgrade hadoop ,as part of this i have set Two environment
variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL .

 


After this i have executed the following command %
NEW_HADOOP_INSTALL/bin/start-dfs -upgrade

 


But namenode didnot started as it was throwing Inconsistent state exception
as the dfs.name.dir is not present

 


Here My question is while upgrading do we need to have the same old
configurations like dfs.name.dir..etc

 


Or Do i need to format that namenode first and then start upgrading?

 


Please let me know

 


Thanks

sandeep

 



***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

 



  
Sandeep This Error occurs due to new namespace issue in Hadoop. Did u 
copy dfs.name.dir and fs.checkpoint dir to new Hadoop  directory.


Namenode Format would cause u to loose all previous data.


Best Regards

Adarsh Sharma


Re: Question from a Desperate Java Newbie

2010-12-15 Thread edward choi
I totally obey the robots.txt since I am only fetching RSS feeds :-)
I implemented my crawler with HttpClient and it is working fine.
I often get messages about Cookie rejected, but am able to fetch news
articles anyway.

I guess the default java.net client is the stateful client you mentioned.
Thanks for the tip!!

Ed

2010년 12월 16일 오전 2:18, Steve Loughran ste...@apache.org님의 말:

 On 10/12/10 09:08, Edward Choi wrote:
  I was wrong. It wasn't because of the read once free policy. I tried
 again with Java first again and this time it didn't work.
  I looked up google and found the Http Client you mentioned. It is the one
 provided by apache, right? I guess I will have to try that one now. Thanks!
 

 httpclient is good, HtmlUnit has a very good client that can simulate
 things like a full web browser with cookies, but that may be overkill.

 NYT's read once policy uses cookies to verify that you are there for the
 first day not logged in, for later days you get 302'd unless you delete
 the cookie, so stateful clients are bad.

 What you may have been hit by is whatever robot trap they have -if you
 generate too much load and don't follow the robots.txt rules they may
 detect this and push back




Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
That clears the confusion. Thanks.
There are just too many tools for Hadoop :-)

2010/12/14 Alejandro Abdelnur t...@cloudera.com

 Ed,

 Actually Oozie is quite different from Cascading.

 * Cascading allows you to write 'queries' using a Java API and they get
 translated into MR jobs.
 * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a DAG
 (workflow jobs) and has timer+data dependency triggers (coordinator jobs).

 Regards.

 Alejandro

 On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote:

  Thanks for the tip. I took a look at it.
  Looks similar to Cascading I guess...?
  Anyway thanks for the info!!
 
  Ed
 
  2010/12/8 Alejandro Abdelnur t...@cloudera.com
 
   Or, if you want to do it in a reliable way you could use an Oozie
   coordinator job.
  
   On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
My mistake. Come to think about it, you are right, I can just make an
infinite loop inside the Hadoop application.
Thanks for the reply.
   
2010/12/7 Harsh J qwertyman...@gmail.com
   
Hi,
   
On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com
 wrote:
 Hi,

 I'm planning to crawl a certain web site every 30 minutes.
 How would I get it done in Hadoop?

 In pure Java, I used Thread.sleep() method, but I guess this won't
   work
in
 Hadoop.
   
Why wouldn't it? You need to manage your post-job logic mostly, but
sleep and resubmission should work just fine.
   
 Or if it could work, could anyone show me an example?

 Ed.

   
   
   
--
Harsh J
www.harshj.com
   
   
  
 



Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
This one doesn't seem so complex for even a newbie like myself. Thanks!!!

2010/12/14 Ted Dunning tdunn...@maprtech.com

 Or even simpler, try Azkaban: http://sna-projects.com/azkaban/

 On Mon, Dec 13, 2010 at 9:26 PM, edward choi mp2...@gmail.com wrote:

  Thanks for the tip. I took a look at it.
  Looks similar to Cascading I guess...?
  Anyway thanks for the info!!
 
  Ed
 
  2010/12/8 Alejandro Abdelnur t...@cloudera.com
 
   Or, if you want to do it in a reliable way you could use an Oozie
   coordinator job.
  
   On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
My mistake. Come to think about it, you are right, I can just make an
infinite loop inside the Hadoop application.
Thanks for the reply.
   
2010/12/7 Harsh J qwertyman...@gmail.com
   
Hi,
   
On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com
 wrote:
 Hi,

 I'm planning to crawl a certain web site every 30 minutes.
 How would I get it done in Hadoop?

 In pure Java, I used Thread.sleep() method, but I guess this won't
   work
in
 Hadoop.
   
Why wouldn't it? You need to manage your post-job logic mostly, but
sleep and resubmission should work just fine.
   
 Or if it could work, could anyone show me an example?

 Ed.

   
   
   
--
Harsh J
www.harshj.com
   
   
  
 



RE: Hadoop upgrade [Do we need to have same value for dfs.name.dir ] while upgrading

2010-12-15 Thread sandeep
Thanks adarsh.

i have done the followign  for NEW_HADOOP_INSTALL (new hadoop version
installation )i have set same values for dfs.name.dir and fs.checkpoint
which i have configured in OLD_HADOOP_INSTALL(old hadoop version
installation)

Now it is working

Thanks
sandeep


***
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

-Original Message-
From: Adarsh Sharma [mailto:adarsh.sha...@orkash.com] 
Sent: Thursday, December 16, 2010 11:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop upgrade [Do we need to have same value for dfs.name.dir
] while upgrading

sandeep wrote:
  

  

 HI ,

  

 I am trying to upgrade hadoop ,as part of this i have set Two environment
 variables NEW_HADOOP_INSTALL and OLD_HADOOP_INSTALL .

  

 After this i have executed the following command %
 NEW_HADOOP_INSTALL/bin/start-dfs -upgrade

  

 But namenode didnot started as it was throwing Inconsistent state
exception
 as the dfs.name.dir is not present

  

 Here My question is while upgrading do we need to have the same old
 configurations like dfs.name.dir..etc

  

 Or Do i need to format that namenode first and then start upgrading?

  

 Please let me know

  

 Thanks

 sandeep

  



 ***
 This e-mail and attachments contain confidential information from HUAWEI,
 which is intended only for the person or entity whose address is listed
 above. Any use of the information contained herein in any way (including,
 but not limited to, total or partial disclosure, reproduction, or
 dissemination) by persons other than the intended recipient's) is
 prohibited. If you receive this e-mail in error, please notify the sender
by
 phone or email immediately and delete it!

  


   
Sandeep This Error occurs due to new namespace issue in Hadoop. Did u 
copy dfs.name.dir and fs.checkpoint dir to new Hadoop  directory.

Namenode Format would cause u to loose all previous data.


Best Regards

Adarsh Sharma



Re: how to run jobs every 30 minutes?

2010-12-15 Thread edward choi
The first recommendation (gluing all my command line apps) is what I am
currently using.
The other ones you mentioned are just out of my league right now, since I am
quite new to Java world, not to mention JRuby, Groovy, Jython, etc.
But when I get comfortable with the environment and start to look for more
options I'll refer to your message. Thanks for the advanced info :-)

2010/12/15 Chris K Wensel ch...@wensel.net


 I see it this way.

 You can glue a bunch of discrete command line apps together that may or may
 not have dependencies between one another in a new syntax. which is darn
 nice if you already have a bunch of discrete ready to run command line apps
 sitting around that need to be strung together, that can't be used as
 libraries and instantiated through their APIs.

 Or, you can string all your work together through the APIs with a turing
 complete language and run them all from a single command line interface (and
 hand that to cron, or some other tool).

 In this case you can use Java, or easier languages like JRuby, Groovy,
 Jython, Clojure, etc which were designed for this purpose. (They don't run
 on the cluster, they only run Hadoop client side).

 Think ant vs graddle (or any other build tool that uses a scripting
 language and not a configuration file) if you want a concrete example.

 Cascading itself is a query API (and query planner). But it also exposes to
 the user the ability to run discrete 'processes' in dependency order for
 you. Either Cascading (Hadoop) Flows or Riffle annotated process objects.
 They all can be intermingled and managed from the same dependency scheduler.
 Cascading has one, and Riffle has one.

 So you can run Flow - Mahout - Pig - Mahout - Flow - shell -
 whattheheckever from the same application.

 Cascading also has the ability to only run 'stale' processes. Think 'make'
 file. When re-running a job where only one file of many has changed, this is
 a big win.

 I personally like parameterizing my applications via the command line and
 letting my cli options drive the workflows. for example, my testing,
 integration, production environments are much different, so its very easy to
 drive specific runs of the jobs by changing a cli arg. (args4j makes this
 darn simple)

 if I am chaining multiple CLI apps into a bigger production app,
 parameterizing that I suspect will be error prone, esp if the input/output
 data points (jdbc vs file) are different in different contexts.

 you can find Riffle here, https://github.com/cwensel/riffle  (its Apache
 Licensed, contributions welcomed)

 ckw

 On Dec 14, 2010, at 1:30 AM, Alejandro Abdelnur wrote:

  Ed,
 
  Actually Oozie is quite different from Cascading.
 
  * Cascading allows you to write 'queries' using a Java API and they get
  translated into MR jobs.
  * Oozie allows you compose sequences of MR/Pig/Hive/Java/SSH jobs in a
 DAG
  (workflow jobs) and has timer+data dependency triggers (coordinator
 jobs).
 
  Regards.
 
  Alejandro
 
  On Tue, Dec 14, 2010 at 1:26 PM, edward choi mp2...@gmail.com wrote:
 
  Thanks for the tip. I took a look at it.
  Looks similar to Cascading I guess...?
  Anyway thanks for the info!!
 
  Ed
 
  2010/12/8 Alejandro Abdelnur t...@cloudera.com
 
  Or, if you want to do it in a reliable way you could use an Oozie
  coordinator job.
 
  On Wed, Dec 8, 2010 at 1:53 PM, edward choi mp2...@gmail.com wrote:
  My mistake. Come to think about it, you are right, I can just make an
  infinite loop inside the Hadoop application.
  Thanks for the reply.
 
  2010/12/7 Harsh J qwertyman...@gmail.com
 
  Hi,
 
  On Tue, Dec 7, 2010 at 2:25 PM, edward choi mp2...@gmail.com
 wrote:
  Hi,
 
  I'm planning to crawl a certain web site every 30 minutes.
  How would I get it done in Hadoop?
 
  In pure Java, I used Thread.sleep() method, but I guess this won't
  work
  in
  Hadoop.
 
  Why wouldn't it? You need to manage your post-job logic mostly, but
  sleep and resubmission should work just fine.
 
  Or if it could work, could anyone show me an example?
 
  Ed.
 
 
 
 
  --
  Harsh J
  www.harshj.com
 
 
 
 

 --
 Chris K Wensel
 ch...@concurrentinc.com
 http://www.concurrentinc.com

 -- Concurrent, Inc. offers mentoring, support, and licensing for Cascading




How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread sravankumar
Hi,

 

Does any one know how to speed up datanode decommissioning and
what are all the configurations

related to the decommissioning.

How to Speed Up Data Transfer from the Datanode getting
decommissioned.

 

Thanks  Regards,

Sravan kumar.



Re: How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread Adarsh Sharma

sravankumar wrote:

Hi,

 


Does any one know how to speed up datanode decommissioning and
what are all the configurations

related to the decommissioning.

How to Speed Up Data Transfer from the Datanode getting
decommissioned.

 


Thanks  Regards,

Sravan kumar.


Check the attachment

--Adarsh

Balancing Data among Datanodes : HDFS will not move blocks to new nodes 
automatically. However, newly created files will likely have their blocks 
placed on the new nodes. 


There are several ways to rebalance the cluster manually. 


-Select a subset of files that take up a good percentage of your disk space; 
copy them to new locations in HDFS; remove the old copies of the files; rename 
the new copies to their original names. 

-A simpler way, with no interruption of service, is to turn up the replication 
of files, wait for transfers to stabilize, and then turn the replication back 
down. 
-Yet another way to re-balance blocks is to turn off the data-node, which is 
full, wait until its blocks are replicated, and then bring it back again. The 
over-replicated blocks will be randomly removed from different nodes, so you 
really get them rebalanced not just removed from the current node. 

-Finally, you can use the bin/start-balancer.sh command to run a balancing 
process to move blocks around the cluster automatically. 


bash-3.2$ bin/start-balancer.sh 
or

$ bin/hadoop balancer -threshold 10

starting balancer, logging to 
/home/hadoop/project/hadoop-0.20.2/bin/../logs/hadoop-hadoop-balancer-ws-test.out
 

Time Stamp   Iteration#  Bytes Already Moved  Bytes Left To Move  
Bytes Being Moved 

The cluster is balanced. Exiting... 
Balancing took 350.0 milliseconds 

A cluster is balanced iff there is no under-capactiy or over-capacity data 
nodes in the cluster.
An under-capacity data node is a node that its %used space is less than 
avg_%used_space-threshhold.
An over-capacity data node is a node that its %used space is greater than 
avg_%used_space+threshhold. 
A threshold is user configurable. A default value could be 20% of % used space.


Re: How to Speed Up Decommissioning progress of a datanode.

2010-12-15 Thread baggio liu
You can use metasave to check the bottleneck of decommion speed,
If the bottleneck is the speed of namenode dispatch. You can tuning
dfs.max-repl-streams to a large number (default 2).
If there're  many timeout block replication tasks from pending replication
queue to need replication , you can tuning
dfs.replication.pending.timeout.sec to a smaller numer, to make block
replcation more positive.

Pay attention!!  Please check your hadoop version, if block transfer has no
speed limit, the bandwidth may be stuff full


Thanks  Best regards
Baggio


2010/12/16 sravankumar sravanku...@huawei.com

 Hi,



Does any one know how to speed up datanode decommissioning and
 what are all the configurations

 related to the decommissioning.

How to Speed Up Data Transfer from the Datanode getting
 decommissioned.



 Thanks  Regards,

 Sravan kumar.