Re: Problem running Hadoop 0.23.0

2011-11-28 Thread Tom White
Hi Nitin,

It looks like you may be using the wrong port number - try 8088 for
the resource manager UI.

Cheers,
Tom

On Mon, Nov 28, 2011 at 4:02 AM, Nitin Khandelwal
nitin.khandel...@germinait.com wrote:
 Hi,

 I was trying to setup Hadoop 0.23.0 with help of
 http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/SingleCluster.html.
 After starting resourcemanager and nodemanager, I get following error when
 i try to hit Hadoop UI *���)org**.apache.hadoop.**ipc.RPC$Version**
 Mismatch���Ser**ver IPC version 5 cannot communicate with client version 47
 *.
 There is no significant error in Hadoop logs (it shows everything started
 successfully).

 Do you have any idea about this error?

 Thanks,

 --

 Nitin Khandelwal



Re: cannot use distcp in some s3 buckets

2011-10-13 Thread Tom White
On Thu, Oct 13, 2011 at 2:06 PM, Raimon Bosch raimon.bo...@gmail.com wrote:
 By the way,

 The url I'm trying has a '_' in the bucket name. Could be this the problem?

Yes, underscores are not permitted in hostnames.

Cheers,
Tom


 2011/10/13 Raimon Bosch raimon.bo...@gmail.com

 Hi,

 I've been having some problems with one of our s3 buckets. I have asked on
 amazon support with no luck yet
 https://forums.aws.amazon.com/thread.jspa?threadID=78001.

 I'm getting this exception only with our oldest s3 bucket with this
 command: hadoop distcp s3://MY_BUCKET_NAME/logfile-20110815.gz
 /tmp/logfile-20110815.gz

 java.lang.IllegalArgumentException: Invalid hostname in URI
 s3://MY_BUCKET_NAME/logfile-20110815.gz /tmp/logfile-20110815.gz
 at org.apache.hadoop.fs.s3.S3Credentials.initialize(S3Credentials.java:41)
 at
 org.apache.hadoop.fs.s3.Jets3tFileSystemStore.initialize(Jets3tFileSystemStore.java:82)

 As you can see, hadoop is rejecting my url before starting to do the
 authorization steps. Someone has been in a similar issue? I have already
 tested the same operation in newer s3 buckets and the command is working
 correctly.

 Thanks in advance,
 Raimon Bosch.






Re: updated example

2011-10-11 Thread Tom White
JobConf and the old API are no longer deprecated in the forthcoming
0.20.205 release, so you can continue to use it without issue.

The equivalent in the new API is setInputFormatClass() on
org.apache.hadoop.mapreduce.Job.

Cheers,
Tom

On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson kthom...@binghamton.edu wrote:
 I see that the JobConf class used in the WordCount tutorial is deprecated
 for the Configuration class.  I am wanting to change the file input format
 (to the StreamInputFormat for XML as in Hadoop: The Definitive Guide pp.
 212-213) but I don't see a setInputFormat method in the Configuration class
 as there was in the JobConf class.  Is there an updated example using the
 non-deprecated classes and methods?  I have searched but not found one.

 Regards,
 Keith



Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Tom White
You might consider Apache Whirr (http://whirr.apache.org/) for
bringing up Hadoop clusters on EC2.

Cheers,
Tom

On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote:
 Dmitry,

 It sounds like an interesting idea, but I have not really heard of anyone 
 doing it before.  It would make for a good feature to have tiered file 
 systems all mapped into the same namespace, but that would be a lot of work 
 and complexity.

 The quick solution would be to know what data you want to process before hand 
 and then run distcp to copy it from S3 into HDFS before launching the other 
 map/reduce jobs.  I don't think there is anything automatic out there.

 --Bobby Evans

 On 8/29/11 4:56 PM, Dmitry Pushkarev u...@stanford.edu wrote:

 Dear hadoop users,

 Sorry for the off-topic. We're slowly migrating our hadoop cluster to EC2,
 and one thing that I'm trying to explore is whether we can use alternative
 scheduling systems like SGE with shared FS for non data intensive tasks,
 since they are easier to work with for lay users.

 One problem for now is how to create shared cluster filesystem similar to
 HDFS, distributed with high-performance, somewhat POSIX compliant (symlinks
 and permissions), that will use amazon EC2 local nonpersistent storage.

 Idea is to keep original data on S3, then as needed fire up a bunch of
 nodes, start shared filesystem, and quickly copy data from S3 to that FS,
 run the analysis with SGE, save results and shut down that filesystem.
 I tried things like S3FS and similar native S3 implementation but speed is
 too bad. Currently I just have a FS on my master node that is shared via NFS
 to all the rest, but I pretty much saturate 1GB bandwidth as soon as I start
 more than 10 nodes.

 Thank you. I'd appreciate any suggestions and links to relevant resources!.


 Dmitry




Re: 0.21.0 - Java Class Error

2011-04-08 Thread Tom White
Hi Witold,

Is this on Windows? The scripts were re-structured after Hadoop 0.20,
and looking at them now I notice that the cygwin path translation for
the classpath seems to be missing. You could try adding the following
line to the if $cygwin clause in bin/hadoop-config.sh:

  CLASSPATH=`cygpath -p -w $CLASSPATH`

It's worth filing a bug for this too.

Cheers,
Tom

On Thu, Apr 7, 2011 at 1:24 PM, Witold Januszewski wit...@skni.org wrote:
 To Whom It May Concern,

 When trying to run Hadoop 0.21 with JDK 1.6_23 I get an error:
 java.lang.NoClassDefFoundError: org/apache/hadoop/util/PlatformName.
 The full error log is in the attached .png
 Can you help me? I'd be grateful.


 Yours faithfully,
 Witold Januszewski





Re: hadoop installation problem(single-node)

2011-03-02 Thread Tom White
The instructions at
http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html should be
what you need.

Cheers,
Tom

On Wed, Mar 2, 2011 at 12:59 AM, Manish Yadav manish.ya...@orkash.com wrote:
 Dear Sir/Madam
  I'm very new to hadoop. I'm trying to install hadoop on my computer. I
 followed a weblink and try to install it. I want to install hadoop on my
 single node cluster.
 i 'm using Ubuntu 10.04 64-bit as my operating system . I have installed
 java in /usr/java/jdk1.6.0_24. the step i take to install hadoop are
 following

 1: Make a group hadoop and a user hadoop with home directory
 in hadoop directory i have a directory called projects and download hadoop
 binary there than extract them there;
 i configured the ssh also.

 than i made changes to some file which are following. i'm attaching them
 with this male please check them .
 1: hadoop_env_sh
 2:core-site.xml
 3mapreduce-site.xml
 4 hdfs-site. xml
 5 hadoop's usre .bashrc
 6 hadoop'user .profile

  After making changes to these fie ,I just enter the hadoop account and
 enter the  few command following thing happen :

 hadoop@ws40-man-lin:~$ echo $HADOOP_HOME
 /home/hadoop/project/hadoop-0.20.0
 hadoop@ws40-man-lin:~$ hadoop namenode -format
 hadoop: command not found
 hadoop@ws40-man-lin:~$ namenode -format
 namenode: command not found
 hadoop@ws40-man-lin:~$

 now I'm completely stuck i don't know what to do? please help me as there is
 no more help around the net.
 i' m attaching the files also which i changed can u tell me the exact
 configuration which i should use to install hadoop.





Re: Missing files in the trunk ??

2011-02-28 Thread Tom White
These files are generated files. If you run ant avro-generate
eclipse then Eclipse should file these files.

Cheers,
Tom

On Mon, Feb 28, 2011 at 2:43 AM, bharath vissapragada
bharathvissapragada1...@gmail.com wrote:
 Hi all,

 I checked out the map-reduce trunk a few days back  and following
 files are missing..

 import org.apache.hadoop.mapreduce.jobhistory.Events;
 import org.apache.hadoop.mapreduce.jobhistory.JhCounter;
 import org.apache.hadoop.mapreduce.jobhistory.JhCounterGroup;
 import org.apache.hadoop.mapreduce.jobhistory.JhCounters;

 ant jar works  well but eclipse finds these files missing in the
 corresponding packages ..

 I browsed the trunk online but couldn't trace these files..

 Any help is highly appreciated :)

 --
 Regards,
 Bharath .V
 w:http://research.iiit.ac.in/~bharath.v



Re: 0.21 found interface but class was expected

2010-11-15 Thread Tom White
Hi Steve,

Sorry to hear about the problems you had. The issue you hit was a
result of MAPREDUCE-954, and there was some discussion on that JIRA
about compatibility. I believe the thinking was that the context
classes are framework classes, so users don't extend/implement them in
the normal course of use, and it's also understood that users would
recompile their apps (i.e. source compatibility). However, tools like
MRUnit which extend/implement these classes do need to be updated when
a change like this happens.

We tried hard to make 0.21 as backwards compatible with 0.20 as
possible, a big part of which was going through all the APIs and
annotating their audience and stability (see
http://developer.yahoo.com/blogs/hadoop/posts/2010/05/towards_enterpriseclass_compat/
for background). The new MapReduce API (in
org.apache.hadoop.mapreduce), which is what we are talking about here,
is not yet declared stable (unlike the old API) and these classes are
marked with @InterfaceStability.Evolving to show that they can change
even between minor releases. I think we could improve visibility to
users by publishing a list of incompatible changes in the API for each
release - so I've opened HADOOP-7035 for this.

Cheers,
Tom

On Sun, Nov 14, 2010 at 7:41 AM, Konstantin Boudnik c...@apache.org wrote:
 Oh, thank you Todd! For a second there I thought that Hadoop developers have
 promised a full binary compatibility (in true Solaris sense of the word).

 Now I understand that such thing never been promised. Even though Hadoop
 haven't come over 'major' version change yet.

 Seriously. Steve, you are talking about leaving and breathing system here. To
 best of my understanding first stable Hadoop version was suppose to be 1.0 - a
 major version according to your own terms. Which apparently hasn't came around
 yet.

 Now, what exactly you are frustrated about?
  Cos

 On Sat, Nov 13, 2010 at 06:50PM, Todd Lipcon wrote:
 We do have policies against breaking APIs between consecutive major versions
 except for very rare exceptions (eg UnixUserGroupInformation went away when
 security was added).

 We do *not* have any current policies that existing code can work against
 different major versions without a recompile in between. Switching an
 implementation class to an interface is a case where a simple recompile of
 the dependent app should be sufficient to avoid issues. For whatever reason,
 the JVM bytecode for invoking an interface method (invokeinterface) is
 different than invoking a virtual method in a class (invokevirtual).

 -Todd

 On Sat, Nov 13, 2010 at 5:28 PM, Lance Norskog goks...@gmail.com wrote:

  It is considered good manners :)
 
  Seriously, if you want to attract a community you have an obligation
  to tell them when you're going to jerk the rug out from under their
  feet.
 
  On Sat, Nov 13, 2010 at 3:27 PM, Konstantin Boudnik c...@apache.org
  wrote:
   It doesn't answer my question. I guess I will have to look for the answer
  somewhere else
  
   On Sat, Nov 13, 2010 at 03:22PM, Steve Lewis wrote:
   Java libraries are VERY reluctant to change major classes in a way that
   breaks backward compatability -
   NOTE that while the 0.18 packages are  deprecated, they are separate
  from
   the 0.20 packages allowing
   0.18 code to run on 0.20 systems - this is true of virtually all Java
   libraries
  
   On Sat, Nov 13, 2010 at 3:08 PM, Konstantin Boudnik c...@apache.org
  wrote:
  
As much as I love ranting I can't help but wonder if there were any
promises
to make 0.21+ be backward compatible with 0.20 ?
   
Just curious?
   
On Sat, Nov 13, 2010 at 02:50PM, Steve Lewis wrote:
 I have a long rant at http://lordjoesoftware.blogspot.com/ on this
  but
 the moral is that there seems to have been a deliberate decision
  that
 0,20
 code will may not be comparable with -
 I have NEVER seen a major library so directly abandon backward
compatability


 On Fri, Nov 12, 2010 at 8:04 AM, Sebastian Schoenherr 
 sebastian.schoenh...@student.uibk.ac.at wrote:

  Hi Steve,
  we had a similar problem. We've compiled our code with version
  0.21 but
  included the wrong jars into the classpath. (version 0.20.2;
  NInputFormat.java). It seems that Hadoop changed this class to an
interface,
  maybe you've a simliar problem.
  Hope this helps.
  Sebastian
 
 
  Zitat von Steve Lewis lordjoe2...@gmail.com:
 
 
   Cassandra sees this error with 0.21 of hadoop
 
  Exception in thread main
  java.lang.IncompatibleClassChangeError:
Found
  interface org.apache.hadoop.mapreduce.JobContext, but class was
expected
 
  I see something similar
  Error: Found interface
org.apache.hadoop.mapreduce.TaskInputOutputContext,
  but class was expected
 
  I find this especially puzzling
  since org.apache.hadoop.mapreduce.TaskInputOutputContext IS a
  class
not 

Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread Tom White
On Thu, Oct 21, 2010 at 8:23 AM, ed hadoopn...@gmail.com wrote:
 Hello,

 The MapRunner classes looks promising.  I noticed it is in the deprecated
 mapred package but I didn't see an equivalent class in the mapreduce
 package.  Is this going to ported to mapreduce or is it no longer being
 supported?  Thanks!

The equivalent functionality is in org.apache.hadoop.mapreduce.Mapper#run.

Cheers
Tom


 ~Ed

 On Thu, Oct 21, 2010 at 6:36 AM, Harsh J qwertyman...@gmail.com wrote:

 If it occurs eventually as your record reader reads it, then you may
 use a MapRunner class instead of a Mapper IFace/Subclass. This way,
 you may try/catch over the record reader itself, and call your map
 function only on valid next()s. I think this ought to work.

 You can set it via JobConf.setMapRunnerClass(...).

 Ref: MapRunner API @

 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/MapRunner.html

 On Wed, Oct 20, 2010 at 4:14 AM, ed hadoopn...@gmail.com wrote:
  Hello,
 
  I have a simple map-reduce job that reads in zipped files and converts
 them
  to lzo compression.  Some of the files are not properly zipped which
 results
  in Hadoop throwing an java.io.EOFException: Unexpected end of input
 stream
  error and causes the job to fail.  Is there a way to catch this
 exception
  and tell hadoop to just ignore the file and move on?  I think the
 exception
  is being thrown by the class reading in the Gzip file and not my mapper
  class.  Is this correct?  Is there a way to handle this type of error
  gracefully?
 
  Thank you!
 
  ~Ed
 



 --
 Harsh J
 www.harshj.com




Re: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell

2010-09-15 Thread Tom White
Hi Mike,

What do you get if you type ./hadoop classpath? Does it contain the
Hadoop common JAR?

To avoid the deprecation warning you should use hadoop fs, not hadoop dfs.

Tom

On Wed, Sep 15, 2010 at 12:53 PM, Mike Franon kongfra...@gmail.com wrote:
 Hi,

 I just setup 3 node hadoop cluster using the latest version from website ,
 0.21.0

 I am able to start all the daemons, when I run jps I see datanode, namenode,
 secondary, tasktracker, but I was running a test and trying to run the
 following command: ./hadoop dfs -ls, and I get the following error:

 DEPRECATED: Use of this script to execute hdfs command is deprecated.
 Instead use the hdfs command for it.

 Exception in thread main java.lang.NoClassDefFoundError:
 org/apache/hadoop/fs/FsShell
 Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.fs.FsShell
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 Could not find the main class: org.apache.hadoop.fs.FsShell.  Program will
 exit.

 If i try this command instead:

 ./hadoop hdfs -ls
 Exception in thread main java.lang.NoClassDefFoundError: hdfs
 Caused by: java.lang.ClassNotFoundException: hdfs
        at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:248)
 Could not find the main class: hdfs.  Program will exit.

 Does anyone know what the command really is I should be using?

 Thanks



Re: Hadoop 0.21.0 release Maven repo

2010-09-10 Thread Tom White
Hi Sonal,

The 0.21.0 jars are not available in Maven yet, since the process for
publishing them post split has changed.
See HDFS-1292 and MAPREDUCE-1929.

Cheers,
Tom

On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi,

 Can someone please point me to the Maven repo for 0.21 release? Thanks.

 Thanks and Regards,
 Sonal
 www.meghsoft.com
 http://in.linkedin.com/in/sonalgoyal



Re: Ivy

2010-09-03 Thread Tom White
The 0.21.0 jars are not in the Apache Maven repos yet, since the
process for publishing them post split has changed. HDFS-1292 and
MAPREDUCE-1929 are the tickets to fix this.

Cheers,
Tom

On Sat, Aug 28, 2010 at 9:10 PM, Mark static.void@gmail.com wrote:
  On 8/27/10 9:25 AM, Owen O'Malley wrote:

 On Aug 27, 2010, at 8:04 AM, Mark wrote:

 Is there a public ivy repo that has the latest hadoop? Thanks

 The hadoop jars and poms should be pushed into the central Maven
 repositories, which Ivy uses.

 -- Owen

 I am looking for the latest version 0.21.0 so our team can build Map/Reduce
 classes against it



[ANNOUNCE] Apache Hadoop 0.21.0 released

2010-08-24 Thread Tom White
Hi everyone,

I am pleased to announce that Apache Hadoop 0.21.0 is available for
download from http://hadoop.apache.org/common/releases.html.

Over 1300 issues have been addressed since 0.20.2; you can find details at

http://hadoop.apache.org/common/docs/r0.21.0/releasenotes.html
http://hadoop.apache.org/hdfs/docs/r0.21.0/releasenotes.html
http://hadoop.apache.org/mapreduce/docs/r0.21.0/releasenotes.html

Please note that this release has not undergone testing at scale and
should not be considered stable or suitable for production. It is
being classified as a minor release, which means that it should be API
compatible with 0.20.2.

Thanks to all who contributed to this release!

Tom


Re: Implementing S3FileSystem#append

2010-08-12 Thread Tom White
Hi Oleg,

I don't know of any plans to implement this. However, since this is a
block-based storage system which uses S3, I wonder whether an
implementation could use some of the logic in HDFS for block storage
and append in general.

Cheers,
Tom

On Thu, Aug 12, 2010 at 8:34 AM, Aleshko, Oleg
o.ales...@itransition.com wrote:
 Hi!

 Is there any plans on implementing append function for S3 file system?

 I'm currently considering using it for implementation of resume upload 
 functionality. The other option would be to use EBS, but it looks like an 
 overkill.

 Thanks,
 Oleg.



Re: Hadoop 0.21 :: job.getCounters() returns null?

2010-07-07 Thread Tom White
Hi Felix,

Aaron Kimball hit the same problem - it's being discussed at
https://issues.apache.org/jira/browse/MAPREDUCE-1920.

Thanks for reporting this.

Cheers,
Tom

On Tue, Jul 6, 2010 at 11:26 AM, Felix Halim felix.ha...@gmail.com wrote:
 I tried hadoop 0.21 release candidate.

 job.waitForCompletion(true);
 Counters ctrs = job.getCounters();
 // here ctrs is null


 In the previous hadoop version 0.20.2 it worked fine for all times.

 Is this a bug in 0.21 ?
 Or i'm missing some settings?

 Thanks,

 Felix Halim



Re: Next Release of Hadoop version number and Kerberos

2010-07-07 Thread Tom White
Hi Ananth,

The next release of Hadoop will be 0.21.0, but it won't have Kerberos
authentication in it (since it's not all in trunk yet). The 0.22.0
release later this year will have a working version of security in it.

Cheers,
Tom

On Wed, Jul 7, 2010 at 8:09 AM, Ananth Sarathy
ananth.t.sara...@gmail.com wrote:

 is the next release of Hadoop going to .21 or .22? I was just wondering,
 cause I am hearing conflicting things about the next release having Kerberos
 security but looking through some past emails, hearing that it was coming in
 .22.


 Ananth T Sarathy


Re: Cloudera EC2 scripts

2010-05-28 Thread Tom White
Hi Mark,

You can find the latest version of the scripts at
http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.tar.gz.
Documentation is at http://archive.cloudera.com/docs/ec2.html.

The source code is currently in src/contrib/cloud in Hadoop Common,
but is in the process of moving to a new Incubator project called
Whirr (see http://incubator.apache.org/projects/whirr.html).

Cheers,
Tom

On Thu, May 27, 2010 at 10:11 PM, Mark Kerzner markkerz...@gmail.com wrote:
 That would be fine, but where is the link to get them

 On Fri, May 28, 2010 at 12:10 AM, Andrew Nguyen 
 andrew-lists-had...@ucsfcti.org wrote:

 I didn't have any problems using the scripts that are in CDH3 (beta, March
 2010) to bring up and tear down Hadoop cluster instances with EC2.

 I think there were some differences between the documentation and the
 actual scripts but it's been a few weeks and I don't have access to my notes
 right now to see what they were.

 --Andrew

 On May 27, 2010, at 9:31 PM, Mark Kerzner wrote:

  Hi,
 
  I was using the beta version of Cloudera scripts from a while back, and I
  think there is a stable version, but I can't find it. It tells me to go
  download a Hadoop distribution, and there I can't find cloudera scripts.
 I
  do see something there, hadoop-0.18.3/src/contrib/ec2/bin, but it does
 not
  look right. Is it me?
 
  Thank you,
  Mark





Re: problem w/ data load

2010-05-03 Thread Tom White
Hi Susanne,

Hadoop uses the file extension to detect that a file is compressed. I
believe Hive does too. Did you store the compressed file in HDFS with
a .gz extension?

Cheers,
Tom

BTW It's best to send Hive questions like these to the hive-user@ list.

On Sun, May 2, 2010 at 11:22 AM, Susanne Lehmann
susanne.lehm...@metamarketsgroup.com wrote:
 Hi,

 I want to load data from HDFS to Hive, the data is in compressed files.
 The data is stored in flat files, the delimiter is ^A (ctrl-A).
 As long as I use de-compressed files everything is working fine. Since
 ctrl-A is the default delimiter I even don't need a specification for
 it.  I do the following:


 hadoop dfs -put /test/file new

 hive  DROP TABLE test_new;
 OK
 Time taken: 0.057 seconds
 hive    CREATE TABLE test_new(
            bla  int,
            bla            string,
            etc
            bla      string);
 OK
 Time taken: 0.035 seconds
 hive LOAD DATA INPATH /test/file INTO TABLE test_new;
 Loading data to table test_new
 OK
 Time taken: 0.063 seconds

 But if I do the same with the same file compressed it's not working
 anymore. I tried tons of different table definitions with the
 delimiter specified, but it doesn't go. The load itself works, but the
 data is always NULL, so there is a delimiter problem I conclude.

  Any help is greatly appreciated!



Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20

2010-04-29 Thread Tom White
Hi Yuanyuan,

I think you've found a bug - could you file a JIRA issue for this please?

Thanks,
Tom

On Wed, Apr 28, 2010 at 11:04 PM, Yuanyuan Tian yt...@us.ibm.com wrote:


 I have a problem in getting the input file name in the mapper  when uisng
 MultipleInputs. I need to use MultipleInputs to support different formats
 for my inputs to the my MapReduce job. And inside each mapper, I also need
 to know the exact input file that the mapper is processing. However,
 conf.get(map.input.file) returns null. Can anybody help me solve this
 problem? Thanks in advance.

 public class Test extends Configured implements Tool{

        static class InnerMapper extends MapReduceBase implements
 MapperWritable, Writable, NullWritable, Text
        {
                
                

                public void configure(JobConf conf)
                {
                        String inputName=conf.get(map.input.file));
                        ...
                }

        }

        public int run(String[] arg0) throws Exception {
                JonConf job;
                job = new JobConf(Test.class);
                ...

                MultipleInputs.addInputPath(conf, new Path(A),
 TextInputFormat.class);
                MultipleInputs.addInputPath(conf, new Path(B),
 SequenceFileFormat.class);
                ...
        }
 }

 Yuanyuan


Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20

2010-04-29 Thread Tom White
Hi Yuanyuan,

Thanks for filing an issue. To work around the issue could you use a
regular FileInputFormat in a set of map-only jobs (which can read the
input file names) so you can create a common input for a final MR job?
This is admittedly less efficient since it needs more jobs.

Cheers,
Tom

On Thu, Apr 29, 2010 at 10:37 AM, Yuanyuan Tian yt...@us.ibm.com wrote:

 Hi Tom,

 I have file a JIRA ticket (MAPREDUCE-1743) for this issue. At the mean time, 
 can you suggest an alternative approach to achieve what I want (supporting 
 different input formats and get the input file name in each mapper)?

 Yuanyuan

 Tom White ---04/29/2010 09:42:44 AM---Hi Yuanyuan, I think you've found a bug 
 - could you file a JIRA issue for this please?


 From:
 Tom White t...@cloudera.com
 To:
 common-user@hadoop.apache.org
 Date:
 04/29/2010 09:42 AM
 Subject:
 Re: conf.get(map.input.file) returns null when using MultipleInputs in 
 Hadoop 0.20
 


 Hi Yuanyuan,

 I think you've found a bug - could you file a JIRA issue for this please?

 Thanks,
 Tom

 On Wed, Apr 28, 2010 at 11:04 PM, Yuanyuan Tian yt...@us.ibm.com wrote:
 
 
  I have a problem in getting the input file name in the mapper  when uisng
  MultipleInputs. I need to use MultipleInputs to support different formats
  for my inputs to the my MapReduce job. And inside each mapper, I also need
  to know the exact input file that the mapper is processing. However,
  conf.get(map.input.file) returns null. Can anybody help me solve this
  problem? Thanks in advance.
 
  public class Test extends Configured implements Tool{
 
         static class InnerMapper extends MapReduceBase implements
  MapperWritable, Writable, NullWritable, Text
         {
                 
                 
 
                 public void configure(JobConf conf)
                 {
                         String inputName=conf.get(map.input.file));
                         ...
                 }
 
         }
 
         public int run(String[] arg0) throws Exception {
                 JonConf job;
                 job = new JobConf(Test.class);
                 ...
 
                 MultipleInputs.addInputPath(conf, new Path(A),
  TextInputFormat.class);
                 MultipleInputs.addInputPath(conf, new Path(B),
  SequenceFileFormat.class);
                 ...
         }
  }
 
  Yuanyuan




Re: File permissions on S3FileSystem

2010-04-22 Thread Tom White
Hi Danny,

S3FileSystem has no concept of permissions, which is why this check
fails. The change that introduced the permissions check was introduced
in https://issues.apache.org/jira/browse/MAPREDUCE-181. Could you file
a bug for this please?

Cheers,
Tom

On Thu, Apr 22, 2010 at 4:16 AM, Danny Leshem dles...@gmail.com wrote:
 Hello,

 I'm running a Hadoop cluster using 3 small Amazon EC2 machines and the
 S3FileSystem.
 Till lately I've been using 0.20.2 and everything was ok.

 Now I'm using the latest trunc 0.22.0-SNAPSHOT and getting the following
 thrown:

 Exception in thread main java.io.IOException: The ownership/permissions on
 the staging directory
 s3://my-s3-bucket/mnt/hadoop.tmp.dir/mapred/staging/root/.staging is not as
 expected. It is owned by  and permissions are rwxrwxrwx. The directory must
 be owned by the submitter root or by root and permissions must be rwx--
    at
 org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmissionFiles.java:107)
    at
 org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:312)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:961)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:977)
    at com.mycompany.MyJob.runJob(MyJob.java:153)
    at com.mycompany.MyJob.run(MyJob.java:177)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at com.mycompany.MyOtherJob.runJob(MyOtherJob.java:62)
    at com.mycompany.MyOtherJob.run(MyOtherJob.java:112)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at com.mycompany.MyOtherJob.main(MyOtherJob.java:117)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:187)

 (The it is owned by ... and permissions  is not a mistake, seems like the
 empty string is printed there)

 My configuration is as follows:

 core-site:
 fs.default.name=s3://my-s3-bucket
 fs.s3.awsAccessKeyId=[key id omitted]
 fs.s3.awsSecretAccessKey=[secret key omitted]
 hadoop.tmp.dir=/mnt/hadoop.tmp.dir

 hdfs-site: empty

 mapred-site:
 mapred.job.tracker=[domU-XX-XX-XX-XX-XX-XX.compute-1.internal:9001]
 mapred.map.tasks=6
 mapred.reduce.tasks=6

 Any help would be appreciated...

 Best,
 Danny



Re: JobConf.setJobEndNotificationURI

2010-03-23 Thread Tom White
I think you can set the URI on the configuration object with the key
JobContext.END_NOTIFICATION_URL.

Cheers,
Tom

On Tue, Feb 23, 2010 at 12:02 PM, Ted Yu yuzhih...@gmail.com wrote:
 Hi,
 I am looking for counterpart to JobConf.setJobEndNotificationURI() in
 org.apache.hadoop.mapreduce

 Please advise.

 Thanks



Re: Cloudera AMIs

2010-03-15 Thread Tom White
Hi Sonal,

You should use the one with the later date. The Cloudera AMIs don't
actually have Hadoop installed on them, just Java and some other base
packages. Hadoop is installed at start up time; you can find more
information at http://archive.cloudera.com/docs/ec2.html.

Cheers,
Tom

P.S. For Cloudera-specific questions please consider using the
Cloudera forum at http://getsatisfaction.com/cloudera

On Sun, Mar 14, 2010 at 7:03 AM, Sonal Goyal sonalgoy...@gmail.com wrote:
 Hi,

 I want to know which Cloudera AMI supports which Hadoop version. For
 example,

 ami-2932d440:cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090602-i386.manifest.xml


 ami-ed59bf84:
 cloudera-ec2-hadoop-images/cloudera-hadoop-ubuntu-20090623-i386.manifest.xml

 Whats the difference between the two? Which Hadoop version do they support?
 I need to use the 0.20+ release.


 Thanks and Regards,
 Sonal



Re: Is it possible to share a key across maps?

2010-01-14 Thread Tom White
Please submit a patch for the documentation change - perhaps at
https://issues.apache.org/jira/browse/HADOOP-5973.

Cheers,
Tom

On Wed, Jan 13, 2010 at 12:09 AM, Amogh Vasekar am...@yahoo-inc.com wrote:
 +1 for the documentation change in mapred-tutorial. Can we do that and 
 publish using a normal apache account?

 Thanks,
 Amogh


 On 1/13/10 2:29 AM, Raymond Jennings III raymondj...@yahoo.com wrote:

 Amogh,
 You bet it helps!  Thanks!  Sometimes it's very difficult to map between the 
 old and the new APIs.  I was digging for that answer for awhile.  Thanks.

 --- On Tue, 1/12/10, Amogh Vasekar am...@yahoo-inc.com wrote:

 From: Amogh Vasekar am...@yahoo-inc.com
 Subject: Re: Is it possible to share a key across maps?
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org, 
 raymondj...@yahoo.com raymondj...@yahoo.com, 
 core-u...@hadoop.apache.org core-u...@hadoop.apache.org
 Date: Tuesday, January 12, 2010, 3:32 PM


 Re: Is it possible to share a key across
 maps?


 (Sorry for the spam if any, mails
 are bouncing back for me)



 Hi,

 In setup() use this,

 FileSplit split = (FileSplit)context.getInputSplit();

  split.getPath() will return you the Path.

 Hope this helps.



 Amogh





 On 1/13/10 1:25 AM, Raymond Jennings III raymondj...@yahoo.com wrote:



 Hi Gang,

 I was able to use this on an older version that uses the
 JobClient class to run the job but not on the newer api with
 the Job class.  The Job class appears to use a setup()
 method instead of a configure() method but the
 map.input.file attribute does not appear to be
 available via the conf class the setup() method.  Have
 you tried to do what you described using the newer api?
  Thank you.



 --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote:















Re: Implementing VectorWritable

2009-12-29 Thread Tom White
Have a look at org.apache.hadoop.io.ArrayWritable. You may be able to
use this class in your application, or at least use it as a basis for
writing VectorWritable.

Cheers,
Tom

On Tue, Dec 29, 2009 at 1:37 AM, bharath v
bharathvissapragada1...@gmail.com wrote:
 Can you please tell me , what is the functionality of those 2 methods.
 (How should i implement the same in this VectorWritable) ..

 Thanks

 On Tue, Dec 29, 2009 at 11:25 AM, Jeff Zhang zjf...@gmail.com wrote:

 The readFields and write method is empty ?

 When data is transfered from map phase to reduce phase, data is serialized
 and deserialized , so the write and readFields will be called. You should
 not leave them empty.


 Jeff Zhang


 On Tue, Dec 29, 2009 at 1:29 PM, bharath v 
 bharathvissapragada1...@gmail.com wrote:

  Hi ,
 
  I've implemented a simple VectorWritable class as follows
 
 
  package com;
 
  import org.apache.hadoop.*;
  import org.apache.hadoop.io.*;
  import java.io.*;
  import java.util.Vector;
 
 
  public class VectorWritable implements WritableComparable {
   private VectorString value = new Vector();
 
   public VectorWritable() {}
 
   public VectorWritable(VectorString value) { set(value); }
 
   public void set(VectorString val) { this.value = val;
   }
 
   public VectorString get() { return this.value; }
 
   public void readFields(DataInput in) throws IOException {
     //value = in.readInt();
   }
 
   public void write(DataOutput out) throws IOException {
   //  out.writeInt(value);
   }
 
   public boolean equals(Object o) {
     if (!(o instanceof VectorWritable))
       return false;
     VectorWritable other = (VectorWritable)o;
     return this.value.equals(other.value);
   }
 
   public int hashCode() {
     return value.hashCode();
   }
 
   public int compareTo(Object o) {
     Vector thisValue = this.value;
     Vector thatValue = ((VectorWritable)o).value;
     return (thisValue.size()thatValue.size() ? -1 :
  (thisValue.size()==thatValue.size() ? 0 : 1));
   }
 
   public String toString() {
     return value.toString();
   }
 
   public static class Comparator extends WritableComparator {
     public Comparator() {
       super(VectorWritable.class);
     }
 
     public int compare(byte[] b1, int s1, int l1,
                        byte[] b2, int s2, int l2) {
 
       int thisValue = readInt(b1, s1);
       int thatValue = readInt(b2, s2);
       return (thisValuethatValue ? -1 : (thisValue==thatValue ? 0 : 1));
     }
   }
 
   static {                                        // register this
  comparator
     WritableComparator.define(VectorWritable.class, new Comparator());
   }
  }
 
  The map phase is outputting correct Text,VectorWritable pairs .. but in
  reduce phase
  when I iterate over the values Iterable.. Iam getting the size of the
  vector
  to be 0; I think there is a minor
  mistake in my VectorWritable Implementation .. Can anyone point it..
 
  Thanks
 




Re: Configuration for Hadoop running on Amazon S3

2009-12-17 Thread Tom White
If you are using S3 as your file store then you don't need to run HDFS
(and indeed HDFS will not start up if you try).

Cheers,
Tom

2009/12/17 Rekha Joshi rekha...@yahoo-inc.com:
 Not sure what the whole error is, but you can always alternatively try this -
 property
  namefs.default.name/name
  values3://BUCKET/value
 /property

 property
  namefs.s3.awsAccessKeyId/name
  valueID/value
 /property

 property
  namefs.s3.awsSecretAccessKey/name
  valueSECRET/value
 /property

 And I am not sure what is the base hadoop version on S3, but possibly if S3 
 wiki is correct try updating conf/hadoop-site.xml

 Cheers,
 /R

 On 12/18/09 10:23 AM, 松柳 lamfeeli...@gmail.com wrote:

 Hi all,
    I tried to run my hadoop program on S3 by following this wiki page:
 http://wiki.apache.org/hadoop/AmazonS3
    I configured the core-site.xml by adding

 property
  namefs.default.name/name
  values3://ID:sec...@bucket/value
 /property

    and I specified the accesskey and secretkey by using the URI
 format:s3://ID:sec...@bucket

 however, it fails and datanodes reports:

 NumberFormatException
 ...

 Is this the right way to config hadoop running on s3? if so, whats the
 problem?

 Regards
 Song




Re: Master and slaves on hadoop/ec2

2009-11-25 Thread Tom White
Correct. The master runs the namenode and jobtracker, but not a
datanode or tasktracker.

Tom

On Tue, Nov 24, 2009 at 4:57 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,

 do I understand it correctly that, when I launch a Hadoop cluster on EC2,
 the master will not be doing any work, and it is just for organizing work,
 while the slaves will be actual workers?

 Thank you,
 Mark



Re: How do I reference S3 from an EC2 Hadoop cluster?

2009-11-25 Thread Tom White
On Tue, Nov 24, 2009 at 9:27 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Yes, Tom, I saw all these problems. I think that I should stop trying to
 imitate EMR - that's where the storing data on S3 appeared, and transfer
 data directly to the Hadoop cluster. Then I will be using all as intended.

 Is there a way to scp directly to the HDFS, or do I need to scp to local
 storage on some machine, and then - to HDFS?

distcp is the appropriate tool for this. There is some guidance on
http://wiki.apache.org/hadoop/AmazonS3.

 Also, is there a way to make
 the master a bigger instance than that of the slaves?

No, this is not supported, but I can see it would be useful,
particularly for larger clusters. Please consider opening a JIRA for
it.

Cheers,
Tom


 Thank you,
 Mark

 On Tue, Nov 24, 2009 at 11:20 PM, Tom White t...@cloudera.com wrote:

 Mark,

 If the data was transferred to S3 outside of Hadoop then you should
 use the s3n filesystem scheme (see the explanation on
 http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
 Hadoop S3 filesystems).

 Also, some people have had problems embedding the secret key in the
 URI, so you can set it in the configuration as follows:

 property
  namefs.s3n.awsAccessKeyId/name
  valueID/value
 /property

 property
  namefs.s3n.awsSecretAccessKey/name
  valueSECRET/value
 /property

 Then use a URI of the form s3n://BUCKET/path/to/logs

 Cheers,
 Tom

 On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi,
 
  I need to copy data from S3 to HDFS. This instruction
 
  bin/hadoop distcp s3://ID:SECRET@BUCKET/path/to/logs logs
 
  does not seem to work.
 
  Thank you.
 




Re: How do I reference S3 from an EC2 Hadoop cluster?

2009-11-24 Thread Tom White
Mark,

If the data was transferred to S3 outside of Hadoop then you should
use the s3n filesystem scheme (see the explanation on
http://wiki.apache.org/hadoop/AmazonS3 for the differences between the
Hadoop S3 filesystems).

Also, some people have had problems embedding the secret key in the
URI, so you can set it in the configuration as follows:

property
  namefs.s3n.awsAccessKeyId/name
  valueID/value
/property

property
  namefs.s3n.awsSecretAccessKey/name
  valueSECRET/value
/property

Then use a URI of the form s3n://BUCKET/path/to/logs

Cheers,
Tom

On Tue, Nov 24, 2009 at 5:47 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,

 I need to copy data from S3 to HDFS. This instruction

 bin/hadoop distcp s3://ID:SECRET@BUCKET/path/to/logs logs

 does not seem to work.

 Thank you.



Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?

2009-11-15 Thread Tom White
Hi Mark,

HADOOP-6108 will add Cloudera's EC2 scripts to the Apache
distribution, with the difference that they will run Apache Hadoop.
The same scripts will also support Cloudera's Distribution for Hadoop,
simply by using a different boot script on the instances. So I would
suggest you use these scripts since they are more flexible than the
existing bash-based ones in Apache (e.g. they also support EBS), and
are likely to have more features added, and support more cloud
providers over time.

Hope this helps.

Tom

On Sun, Nov 15, 2009 at 7:31 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi, guys,

 sorry for kind of making you do my work, but I have a conundrum. I have been
 developing on Ubuntu, and preferred to run the same Ubuntu Linux on EC2, and
 indeed, that is what Amazon Elastic MR was giving me.

 But now I am running my own cluster on EC2, and Apache Hadoop images are all
 on Fedora. I have already figured out the scripts and it all works - except
 that I have not tested on Fedora, and I do use Linux packages.

 Alternatively, I could run on Cloudera's Hadoop, and they have Ubuntu. But,
 I would probably to switch to their distribution in my code, and learn their
 startup scripts.

 Which way is better?

 Thank you,
 Mark



Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?

2009-11-15 Thread Tom White
On Sun, Nov 15, 2009 at 8:39 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Tom,

 do I understand correctly that with these scripts I can use the Apache
 Hadoop configuration as I am used to, and run and EC2 image that contains
 Cloudera Hadoop distribution?

Yes, you can run Apache Hadoop with your existing configuration.


 PS. I could not download them from here,
 http://issues.apache.org/jira/secure/attachment/12422889/HADOOP-6108.patch,
 was getting, too many open files error.

I think this may be a transient problem (if it recurs you can report
it to in...@apache.org).


 Thank you,
 Mark

 On Sun, Nov 15, 2009 at 10:29 PM, Tom White t...@cloudera.com wrote:

 Hi Mark,

 HADOOP-6108 will add Cloudera's EC2 scripts to the Apache
 distribution, with the difference that they will run Apache Hadoop.
 The same scripts will also support Cloudera's Distribution for Hadoop,
 simply by using a different boot script on the instances. So I would
 suggest you use these scripts since they are more flexible than the
 existing bash-based ones in Apache (e.g. they also support EBS), and
 are likely to have more features added, and support more cloud
 providers over time.

 Hope this helps.

 Tom

 On Sun, Nov 15, 2009 at 7:31 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Hi, guys,
 
  sorry for kind of making you do my work, but I have a conundrum. I have
 been
  developing on Ubuntu, and preferred to run the same Ubuntu Linux on EC2,
 and
  indeed, that is what Amazon Elastic MR was giving me.
 
  But now I am running my own cluster on EC2, and Apache Hadoop images are
 all
  on Fedora. I have already figured out the scripts and it all works -
 except
  that I have not tested on Fedora, and I do use Linux packages.
 
  Alternatively, I could run on Cloudera's Hadoop, and they have Ubuntu.
 But,
  I would probably to switch to their distribution in my code, and learn
 their
  startup scripts.
 
  Which way is better?
 
  Thank you,
  Mark
 




Re: Confused by new API MultipleOutputFormats using Hadoop 0.20.1

2009-11-08 Thread Tom White
Multiple outputs has been ported to the new API in 0.21. See
https://issues.apache.org/jira/browse/MAPREDUCE-370.

Cheers,
Tom

On Sat, Nov 7, 2009 at 6:45 AM, Xiance SI(司宪策) adam...@gmail.com wrote:
 I just fall back to old mapred.* APIs, seems MultipleOutputs only works for
 the old API.

 wishes,
 Xiance

 On Mon, Nov 2, 2009 at 9:12 AM, Paul Smith psm...@aconex.com wrote:

 Totally stuck here, I can't seem to find a way to resolve this, but I can't
 use the new API _and_ use the MultipleOutputFormats class.

 I found this thread which is related, but doesn't seem to help me (or I
 missed something completely, certainly possible):


 http://markmail.org/message/u4wz5nbcn5rawydq#query:hadoop%20MultipleTextOutputFormat%20OutputFormat%20Job%20JobConf+page:1+mid:5wy63oqa2vs6bj7b+state:results

 My controller Job class is simple, but I get a compile error trying to add
 the new MultipleOutputs:

 public class ControllerMetricGrinder {

    public static class MetricNameMultipleTextOutputFormat extends
            MultipleTextOutputFormatString, ControllerMetric {

       �...@override
        protected String generateFileNameForKeyValue(String key,
 ControllerMetric value, String name) {
            return key;
        }

    }
    public static void main(String[] args) throws Exception {

        Job job = new Job();
        job.setJarByClass(ControllerMetricGrinder.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(ControllerMetric.class);

        job.setMapperClass(ControllerMetricMapper.class);

        job.setCombinerClass(ControllerMetricReducer.class);
        job.setReducerClass(ControllerMetricReducer.class);

        // COMPILE ERROR HERE
        MultipleOutputs.addMultiNamedOutput(job, metrics,
                MetricNameMultipleTextOutputFormat.class,
                Text.class, ControllerMetric.class);

        job.setNumReduceTasks(5);

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
 }

 (mappers and reducers are using the new API, and are in separate classes).

 MultipleOutputs doesn't take a Job, it only takes a JobConf.  Any ideas?
  I'd prefer to use the new API (because I've written it that way), but I'm
 guessing now I'll have to go and rework everything to the OLD API to get
 this to work.

 I'm trying to create a File-per-metric name (there's only 5).

 thoughts?

 Paul




Re: Multiple Input Paths

2009-11-08 Thread Tom White
MultipleInputs is available from Hadoop 0.19 onwards (in
org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input
for the new API in later versions).

Tom

On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant
mark.vige...@riskmetrics.com wrote:
 Amogh,

 That sounds so awesome! Yeah I wish I had that class now. Do you have any 
 tips on how to create such a delegating class? The best I can come up with is 
 to just submit both files to the mapper using multiple input paths and then 
 having anif statement at the beginning of the map that checks which file it's 
 dealing with but I'm skeptical that I can even make that work... Is there a 
 way you know of that I could submit 2 mapper classes to the job?

 -Original Message-
 From: Amogh Vasekar [mailto:am...@yahoo-inc.com]
 Sent: Wednesday, November 04, 2009 1:50 AM
 To: common-user@hadoop.apache.org
 Subject: Re: Multiple Input Paths

 Hi Mark,
 A future release of Hadoop will have a MultipleInputs class, akin to 
 MultipleOutputs. This would allow you to have a different inputformat, mapper 
 depending on the path you are getting the split from. It uses special 
 Delegating[mapper/input] classes to resolve this. I understand backporting 
 this is more or less out of question, but the ideas there might provide 
 pointers to help you solve your current problem.
 Just a thought :)

 Amogh


 On 11/3/09 8:44 PM, Mark Vigeant mark.vige...@riskmetrics.com wrote:

 Hey Vipul

 No I haven't concatenated my files yet, and I was just thinking over how to 
 approach the issue of multiple input paths.

 I actually did what Amandeep hinted at which was we wrote our own 
 XMLInputFormat and XMLRecordReader. When configuring the job in my driver I 
 set job.setInputFormatClass(XMLFileInputFormat.class) and what it does is 
 send chunks of XML to the mapper as opposed to lines of text or whole files. 
 So I specified the Line Delimiter in the XMLRecordReader (ie startTag) and 
 everything in between the tags startTag and /startTag are sent to the 
 mapper. Inside the map function is where to parse the data and write it to 
 the table.

 What I have to do now is just figure out how to set the Line Delimiter to be 
 something common in both XML files I'm reading. Currently I have 2 mapper 
 classes and thus 2 submitted jobs which is really inefficient and time 
 consuming.

 Make sense at all? Sorry if it doesn't, feel free to ask more questions

 Mark

 -Original Message-
 From: Vipul Sharma [mailto:sharmavi...@gmail.com]
 Sent: Monday, November 02, 2009 7:48 PM
 To: common-user@hadoop.apache.org
 Subject: RE: Multiple Input Paths

 Mark,

 were you able to concatenate both the xml files together. What did you do to
 keep the resulting xml well forned?

 Regards,
 Vipul Sharma,
 Cell: 281-217-0761




Re: Terminate Instances Terminating ALL EC2 Instances

2009-10-19 Thread Tom White
Hi Mark,

Sorry to hear that all your EC2 instances were terminated. Needless to
say, this should certainly not happen.

The scripts are a Python rewrite (see HADOOP-6108) of the bash ones so
HADOOP-1504 is not applicable, but the behaviour should be the same:
the terminate-cluster command lists the instances that it will
terminate, and prompts for confirmation that they should be
terminated. Is it listing instances that are not in the cluster? I
have used this script a lot and it has never terminated any instances
that are not in the cluster.

What are the names of the security groups that the instances are in
(both those in the cluster, and those outside the cluster that are
inadvertently terminated)?

Thanks,
Tom

On Mon, Oct 19, 2009 at 4:41 PM, Mark Stetzer stet...@gmail.com wrote:
 Hey all,

 While running the (latest as of Friday) Cloudera-created EC2 scripts,
 I noticed that running the terminate-cluster script kills ALL of your
 EC2 nodes, not just those associated with the cluster.  This has been
 documented before in HADOOP-1504
 (http://issues.apache.org/jira/browse/HADOOP-1504), and a fix was
 integrated way back on June 21, 2007.  My questions are:

 1)  Is anyone else seeing this?  I can reproduce this behavior consistently.
 AND
 2)  Is this a regression in the common code, a problem with the
 Cloudera scripts, or just user error on my part?

 Just trying to get to the bottom of this so no one else has to see all
 of their EC2 instances die accidentally :(

 Thanks!

 -Mark



Re: Terminate Instances Terminating ALL EC2 Instances

2009-10-19 Thread Tom White
On Mon, Oct 19, 2009 at 5:34 PM, Mark Stetzer stet...@gmail.com wrote:
 Hi Tom,

 The terminate-cluster script only lists the instances that are part of
 the cluster (master and all slaves) as far as I can tell.  As an
 example, I set up a cluster of 1 master and 5 slaves, then started an
 additional non-Hadoop server via the AWS mgmt. console running a
 completely different AMI (OpenSolaris 2009.06 just to be very
 different).  terminate-cluster only listed the 6 instances that were
 part of the cluster if I remember correctly.

 I have 4 security groups:  default, default-master, default-slave, and
 mark-default.  mark-default wasn't even added until after I started
 the Hadoop cluster; I added it to log in to the OpenSolaris instance.

I think there is a bug here. I've filed
https://issues.apache.org/jira/browse/HADOOP-6320. As an immediate
workaround you can avoid calling the Hadoop cluster default, and
make sure that you don't create non-Hadoop EC2 instances in the
cluster group.

Thanks,
Tom


 Does this help at all?  Thanks.

 -Mark

 On Mon, Oct 19, 2009 at 11:52 AM, Tom White t...@cloudera.com wrote:
 Hi Mark,

 Sorry to hear that all your EC2 instances were terminated. Needless to
 say, this should certainly not happen.

 The scripts are a Python rewrite (see HADOOP-6108) of the bash ones so
 HADOOP-1504 is not applicable, but the behaviour should be the same:
 the terminate-cluster command lists the instances that it will
 terminate, and prompts for confirmation that they should be
 terminated. Is it listing instances that are not in the cluster? I
 have used this script a lot and it has never terminated any instances
 that are not in the cluster.

 What are the names of the security groups that the instances are in
 (both those in the cluster, and those outside the cluster that are
 inadvertently terminated)?

 Thanks,
 Tom

 On Mon, Oct 19, 2009 at 4:41 PM, Mark Stetzer stet...@gmail.com wrote:
 Hey all,

 While running the (latest as of Friday) Cloudera-created EC2 scripts,
 I noticed that running the terminate-cluster script kills ALL of your
 EC2 nodes, not just those associated with the cluster.  This has been
 documented before in HADOOP-1504
 (http://issues.apache.org/jira/browse/HADOOP-1504), and a fix was
 integrated way back on June 21, 2007.  My questions are:

 1)  Is anyone else seeing this?  I can reproduce this behavior consistently.
 AND
 2)  Is this a regression in the common code, a problem with the
 Cloudera scripts, or just user error on my part?

 Just trying to get to the bottom of this so no one else has to see all
 of their EC2 instances die accidentally :(

 Thanks!

 -Mark





Re: JobTracker startup failure when starting hadoop-0.20.0 cluster on Amazon EC2 with contrib/ec2 scripts

2009-09-07 Thread Tom White
Hi Jeyendran,

Were there any errors reported in the datanode logs? There could be a
problem with datanodes contacting the namenode, caused by firewall
configuration problems (EC2 security groups).

Cheers,
Tom

On Fri, Sep 4, 2009 at 12:17 AM, Jeyendran
Balakrishnanjbalakrish...@docomolabs-usa.com wrote:
 I downloaded Hadoop 0.20.0 and used the src/contrib/ec2/bin scripts to
 launch a Hadoop cluster on Amazon EC2, after building a new Hadoop
 0.20.0 AMI.

 I launched an instance with my new Hadoop 0.20.0 AMI, then logged in and
 ran the following to launch a new cluster:
 root(/vol/hadoop-0.20.0) bin/launch-hadoop-cluster hadoop-test 2

 After the usual EC2 wait, one master and two slave instances were
 launched on EC2, as expected. When I ssh'ed into the instances, here is
 what I found:

 Slaves: DataNode and NameNode are running
 Master: Only NameNode is running

 I could use HDFS commands (using $HADOOP_HOME/bin/hadoop scripts)
 without any problems, from both master and slaves. However, since
 JobTracker is not running, I cannot run map-reduce jobs.

 I checked the logs from /vol/hadoop-0.20.0/logs for the JobTracker,
 reproduced below:
 
 
 2009-09-03 18:55:38,486 WARN org.apache.hadoop.conf.Configuration:
 DEPRECATED: hadoop-site.xml found in the classpath. Usage of
 hadoop-site.xml is deprecated. Instead use core-site.xml,
 mapred-site.xml and h
 dfs-site.xml to override properties of core-default.xml,
 mapred-default.xml and hdfs-default.xml respectively
 2009-09-03 18:55:38,520 INFO org.apache.hadoop.mapred.JobTracker:
 STARTUP_MSG:
 /
 STARTUP_MSG: Starting JobTracker
 STARTUP_MSG:   host =
 domU-12-31-39-06-44-E3.compute-1.internal/10.208.75.17
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
 763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
 /
 2009-09-03 18:55:38,652 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=JobTracker, port=50002
 2009-09-03 18:55:38,703 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2009-09-03 18:55:38,827 INFO org.apache.hadoop.http.HttpServer: Jetty
 bound to port 50030
 2009-09-03 18:55:38,827 INFO org.mortbay.log: jetty-6.1.14
 2009-09-03 18:55:48,425 INFO org.mortbay.log: Started
 selectchannelconnec...@0.0.0.0:50030
 2009-09-03 18:55:48,427 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=JobTracker, sessionId=
 2009-09-03 18:55:48,432 INFO org.apache.hadoop.mapred.JobTracker:
 JobTracker up at: 50002
 2009-09-03 18:55:48,432 INFO org.apache.hadoop.mapred.JobTracker:
 JobTracker webserver: 50030
 2009-09-03 18:55:48,541 INFO org.apache.hadoop.mapred.JobTracker:
 Cleaning up the system directory
 2009-09-03 18:55:48,628 INFO org.apache.hadoop.hdfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /mnt/hadoop/mapred/system/jobtracker.info could only be replicated to 0
 nodes,
 instead of 1
        at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(F
 SNamesystem.java:1256)
        at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:4
 22)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

        at org.apache.hadoop.ipc.Client.call(Client.java:739)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at $Proxy4.addBlock(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
 a:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
 Impl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvo
 cationHandler.java:82)
        at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocation
 Handler.java:59)
        at $Proxy4.addBlock(Unknown Source)
        at
 

Re: Can't find TestDFSIO

2009-08-24 Thread Tom White
Hi Cam,

Looks like it's in hadoop-hdfs-hdfswithmr-test-0.21.0-dev.jar, which
should be built with ant jar-test.

Cheers,
Tom

On Mon, Aug 24, 2009 at 8:22 PM, Cam Macdonellc...@cs.ualberta.ca wrote:

 Thanks Danny,

 It currently does not show up hadoop-common-test, hadoop-hdfs-test or
 hadoop-mapred-test with 0.21-dev.  So either it has been a victim of the
 project split or I didn't specify the right target for Ant.

 Cam

 Gross, Danny wrote:

 Hi Cam,

 For what it's worth, in 19.1, I see TestDFSIO in the
 hadoop-0.19.1-test.jar.
 Best regards,

 Danny

 -Original Message-
 From: Cam Macdonell [mailto:c...@cs.ualberta.ca] Sent: Monday, August 24,
 2009 12:00 PM
 To: common-user@hadoop.apache.org
 Subject: Can't find TestDFSIO


 Hi,

 I'm trying to run the TestDFSIO benchmark that is mentioned in the hadoop
 o'reilly book.  However, I can't find it in any of the jars (common, mapred
 or hdfs).

 For example, I presume it would be under hdfs, but the only mentioned test
 is 'dfsthroughput'.

 $ ./bin/hadoop jar
 /home/cam/research/SVN/hadoop/lib/hadoop-hdfs-test-0.21.0-dev.jar
 An example program must be given as the first argument.
 Valid program names are:
   dfsthroughput: measure hdfs throughput

 Has the name of TestDFSIO changed or am I looking in the wrong place?

 Any tips or pointers are appreciated,
 Cam



Re: File Chunk to Map Thread Association

2009-08-20 Thread Tom White
Hi Roman,

Have a look at CombineFileInputFormat - it might be related to what
you are trying to do.

Cheers,
Tom

On Thu, Aug 20, 2009 at 10:59 AM, roman kolcunroman.w...@gmail.com wrote:
 On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi 
 harish.mallipe...@gmail.com wrote:

 On Thu, Aug 20, 2009 at 2:39 PM, roman kolcun roman.w...@gmail.com
 wrote:

 
  Hello Harish,
 
  I know that TaskTracker creates separate threads (up to
  mapred.tasktracker.map.tasks.maximum) which execute the map() function.
  However, I haven't found the piece of code which associate FileSplit with
  the given map thread. Is it downloaded locally in the TaskTracker
 function
  or in MapTask?
 
 
 
 Yes this is done by the MapTask.


 Thanks, I will have a better look into it.



 
  I know I can increase the input file size by changing
  'mapred.min.split.size' , however, the file is split sequentially and
 very
  rarely two consecutive HDFS blocks are stored on a single node. This
 means
  that the data locality will not be exploited cause every map() will have
 to
  download part of the file from network.
 
  Roman Kolcun
 

 I see what you mean - you want to modify the hadoop code to allocate
 multiple (non-sequential) data-local blocks to one MapTask.


 That's exactly what I want to do.


 I don't know if you'll achieve much by doing all that work.


 Basically I would like to emulate larger DFS blocksize. I've performed 2
 word count benchmarks on a cluster of 10 machines with 100GB file. With 64MB
 blocksize it took 2035 seconds, when I've increased it to 256MB it took 1694
 seconds - which is 16.76% increase.


 Hadoop lets you reuse the
 launched JVMs for multiple MapTasks. That should minimize the overhead of
 launching MapTasks.
 Increasing the DFS blocksize for the input files is another means to
 achieve
 the same effect.

 Do you think that this could be eliminated by reusing JVMs?
 I am doing it as a project for my university degree so I really hope it will
 lower the processing time significantly. I would like to make it general for
 different block sizes.

 Thank you for your help.

 Roman Kolcun



Re: MapFile performance

2009-08-03 Thread Tom White
On Mon, Aug 3, 2009 at 3:09 AM, Billy
Pearsonbilly_pear...@sbcglobal.net wrote:


 not sure if its still there but there was a parm in the hadoop-site conf
 file that would allow you to skip x number if index when reading it in to
 memory.

This is io.map.index.skip (default 0), which will skip this number of
keys for every key in the index. For example, if set to 2, one third
of the keys will end up in memory.

 From what I understand we scan find the key offset just before the data and
 seek once and read until we find the key.

 Billy


 - Original Message - From: Andy Liu
 andyliu1227-re5jqeeqqe8avxtiumw...@public.gmane.org
 Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
 To: core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org
 Sent: Tuesday, July 28, 2009 7:53 AM
 Subject: MapFile performance


 I have a bunch of Map/Reduce jobs that process documents and writes the
 results out to a few MapFiles.  These MapFiles are subsequently searched
 in
 an interactive application.

 One problem I'm running into is that if the values in the MapFile data
 file
 are fairly large, lookup can be slow.  This is because the MapFile index
 only stores every 128th key by default (io.map.index.interval), and after
 the binary search it may have to scan/skip through up to 127 values (off
 of
 disk) before it finds the matching record.  I've tried
 io.map.index.interval
 = 1, which brings average get() times from 1200ms to 200ms, but at the
 cost
 of memory during runtime, which is undesirable.

 One possible solution is to have the MapFile index store every single
 key,
 offset pair.  Then MapFile.Reader, upon startup, would read every 128th
 key
 in memory.  MapFile.Reader.get() would behave the same way except instead
 of
 seeking through the values SequenceFile it would seek through the index
 SequenceFile until it finds the matching record, and then it can seek to
 the
 corresponding offset in the values.  I'm going off the assumption that
 it's
 much faster to scan through the index (small keys) than it is to scan
 through the values (large values).

 Or maybe the index can be some kind of disk-based btree or bdb-like
 implementation?

 Anybody encounter this problem before?

 Andy






Re: Status of 0.19.2

2009-08-03 Thread Tom White
I've now updated the news section, and the documentation on the
website to reflect the 0.19.2 release.

There were several reports of it being more stable than 0.19.1 in the
voting thread: 
http://www.mail-archive.com/common-...@hadoop.apache.org/msg00051.html

Cheers,
Tom

On Tue, Jul 28, 2009 at 12:37 PM, Tamir Kamara tamirkam...@gmail.com wrote:

 Hi,

 I've seen that the 0.19.2 version was added recently to the downloads but
 there's no entry under the news section.
 Is it stable enough for deployment?

 Thanks,
 Tamir


Re: Reading GZIP input files.

2009-07-31 Thread Tom White
That's for the case where you want to do the decompression yourself,
explicitly, perhaps when you are reading the data out of HDFS (and not
using MapReduce).  When using compressed data as input to a MapReduce
job, Hadoop will automatically decompress them for you.

Tom

On Fri, Jul 31, 2009 at 5:34 PM, David Beendaveb...@gmail.com wrote:
 I'm new, reading Tom White's book, but there is an example using:

 CompressionCodecFactory factory = new CompressionCodecFactory(conf);
 CompressionCodec codec = factory.getCodec(inputPath); // infers from file ext.
 InputStream in = codec.createInputStream(fs.open(inputPath));

 On Fri, Jul 31, 2009 at 8:01 AM, prashant
 ullegaddiprashullega...@gmail.com wrote:
 Hi guys,

 I have a set of 1000 gzipped plain text files. How to read them in Hadoop?
 Is there any built-in class available for it?

 Btw, I'm using hadoop-0.18.3.

 Regards,
 Prashant.




Re: Using JobControl in hadoop

2009-07-17 Thread Tom White
Hi Raakhi,

JobControl is designed to be run from a new thread:

Thread t = new Thread(jobControl);
t.start();

Then you can run a loop to poll for job completion and print out status:

String oldStatus = null;
while (!jobControl.allFinished()) {
  String status = getStatusString(jobControl);
  if (!status.equals(oldStatus)) {
System.out.println(status);
oldStatus = status;
  }
  try {
Thread.sleep(1000);
  } catch (InterruptedException e) {
// ignore
  }
}

Hope this helps.

Tom

On Fri, Jul 17, 2009 at 9:10 AM, Rakhi Khatwanirakhi.khatw...@gmail.com wrote:
 Hi,
       I was trying out a map-reduce example using JobControl.
 i create a jobConf conf1 object, add the necessary information
 then i create a job object
 Job job1 = new Job(conf1);

 n thn i delare JobControl object as follows:
 JobControl jobControl = new JobControl(JobControl1);
                  jobControl.addJob(job1);
                  jobControl.run();



 whn i execute it in the console,
 i get the following output
 09/07/17 13:10:16 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 09/07/17 13:10:16 INFO mapred.FileInputFormat: Total input paths to process
 : 4




 n there is no other output.
 but from the UI i can c that the job has been executed.
 if there any way i can direct the output to the console.
 or is there any way in which while the job is runing, i can continue
 processing from main. (i wanna try suspending/stopping jobs etc).

 Regards,
 Raakhi



Re: access Configuration object in Partioner??

2009-07-14 Thread Tom White
Hi Jianmin,

Partitioner extends JobConfigurable, so you can implement the
configure() method to access the JobConf.

Hope that helps.

Cheers,
Tom

On Tue, Jul 14, 2009 at 10:27 AM, Jianmin Woojianmin_...@yahoo.com wrote:
 Hi,

 I am considering to implement a Partitioner that needs to access the 
 parameters in Configuration of job. However, there is no straightforward way 
 for this task. Are there any suggestions?

 Thanks,
 Jianmin






Re: more than one reducer in standalone mode

2009-07-14 Thread Tom White
There's a Jira to fix this here:
https://issues.apache.org/jira/browse/MAPREDUCE-434

Tom

On Mon, Jul 13, 2009 at 12:34 AM, jason hadoopjason.had...@gmail.com wrote:
 If the jobtracker is set to local, there is no way to have more than 1
 reducer.

 On Sun, Jul 12, 2009 at 12:21 PM, Rares Vernica rvern...@gmail.com wrote:

 Hello,

 Is it possible to have more than one reducer in standalone mode? I am
 currently using 0.17.2.1 and I do:

 job.setNumReduceTasks(4);

 before starting the job and it seems that Hadoop overrides the
 variable, as it says:

 09/07/12 12:07:40 INFO mapred.MapTask: numReduceTasks: 1

 Thanks!
 Rares




 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: access Configuration object in Partioner??

2009-07-14 Thread Tom White
Hi Jianmin,

Sorry - I (incorrectly) assumed you were using the old API.
Partitioners don't yet work with the new API (see
https://issues.apache.org/jira/browse/MAPREDUCE-565). However, when
they do you can make your Partitioner implement Configurable (by
extending Configured, for example), and this will give you access to
the job configuration, since the framework will set it for you on the
partitioner.

Cheers
Tom

On Tue, Jul 14, 2009 at 12:46 PM, Jianmin Woojianmin_...@yahoo.com wrote:
 Thanks a lot for your information, Tom.

 I am using the org.apache.hadoop.mapreduce.Partitioner in 0.20. It seems that 
 the org.apache.hadoop.mapred.Partitioner is deprecated and will be removed in 
 the futture.
 Do you have some suggestions on this?

 Thanks,
 Jianmin




 
 From: Tom White t...@cloudera.com
 To: common-user@hadoop.apache.org
 Sent: Tuesday, July 14, 2009 6:03:34 PM
 Subject: Re: access Configuration object in Partioner??

 Hi Jianmin,

 Partitioner extends JobConfigurable, so you can implement the
 configure() method to access the JobConf.

 Hope that helps.

 Cheers,
 Tom

 On Tue, Jul 14, 2009 at 10:27 AM, Jianmin Woojianmin_...@yahoo.com wrote:
 Hi,

 I am considering to implement a Partitioner that needs to access the 
 parameters in Configuration of job. However, there is no straightforward way 
 for this task. Are there any suggestions?

 Thanks,
 Jianmin










Re: Restarting a killed job from where it left

2009-07-13 Thread Tom White
Hi Akhil,

Have a look at the mapred.jobtracker.restart.recover property.

Cheers,
Tom

On Sun, Jul 12, 2009 at 12:06 AM, akhil1988akhilan...@gmail.com wrote:

 HI All,

 I am looking for ways to restart my hadoop job from where it left when the
 entire cluster goes down or the job gets stopped due to some reason i.e. I
 am looking for ways in which I can store at regular intervals the status of
 my job and then when I restart the job it starts from where it left rather
 than starting from the beginning again.

 Can anyone please give me some reference to read about the ways to handle
 this.

 Thanks,
 Akhil
 --
 View this message in context: 
 http://www.nabble.com/Restarting-a-killed-job-from-where-it-left-tp2618p2618.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.