Re: Creating Sequence File in C++

2009-11-27 Thread Owen O'Malley
On Fri, Nov 27, 2009 at 7:07 PM, Saptarshi Guha wrote:

Let my Key-Value be something like BinaryWritables (my own class, but
> something like this).  Is there a way to create the Sequence File
> composed of several such key - values, without using Java?
>

There is not a C++ implementation of SequenceFiles. (If you write one,
please consider contributing it back.)

A different approach would make a map only Pipes (C++) MapReduce program
that reads the data and uses SequenceFileOutputFormat for its output. The
map can emit key/value pairs as std::strings containing the bytes you want
to write.

-- Owen


Creating Sequence File in C++

2009-11-27 Thread Saptarshi Guha
Hello,

Let my Key-Value be something like BinaryWritables (my own class, but
something like this).  Is there a way to create the Sequence File
composed of several such key - values, without using Java?

Background:

I create objects using protocol buffers, my key and values are
serialized versions of these protocol buffer messages. These hadoop k-v
pairs that are exchanged in the mapreduce (and stored in both output and
input) are the serialized versions of these.

I would like to directly create sequence files using C++
and was curious if there is way to do this outside Java (and not have to
use JNI), as currently, its best to use a mapreduce job to convert my
textfiles to sequence files.



Thank you
Saptarshi


Re: Processing 10MB files in Hadoop

2009-11-27 Thread Aaron Kimball
By default you get at least one task per file; if any file is bigger than a
block, then that file is broken up into N tasks where each is one block
long. Not sure what you mean by "properly calculate" -- as long as you have
more tasks than you have cores, then you'll definitely have work for every
core to do; having more tasks with high granularity will also let nodes that
get "small" tasks to complete many of them while other cores are stuck with
the "heavier" tasks.

If you call setNumMapTasks() with a higher number of tasks than the
InputFormat creates (via the algorithm above), then it should create
additional tasks by dividing files up into smaller chunks (which may be
sub-block-sized).

As for where you should run your computation.. I don't know that the "map"
and "reduce" phases are really "optimized" for computation in any particular
way. It's just a data motion thing. (At the end of the day, it's your code
doing the processing on either side of the fence, which should dominate the
execution time.) If you use an identity mapper with a pseudo-random key to
spray the data into a bunch of reduce partitions, then you'll get a bunch of
reducers each working on a hopefully-evenly-sized slice of the data. So the
map tasks will quickly read from the original source data and forward the
workload along to the reducers which do the actual heavy lifting. The cost
of this approach is that you have to pay for the time taken to transfer the
data from the mapper nodes to the reducer nodes and sort by key when it gets
there. If you're only working with 600 MB of data, this is probably
negligible. The advantages of doing your computation in the reducers is

1) You can directly control the number of reducer tasks and set this equal
to the number of cores in your cluster.
2) You can tune your partitioning algorithm such that all reducers get
roughly equal workload assignments, if there appears to be some sort of skew
in the dataset.

The tradeoff is that you have to ship all the data to the reducers before
computation starts, which sacrifices data locality and involves an
"intermediate" data set of the same size as the input data set. If this is
in the range of hundreds of GB or north, then this can be very
time-consuming -- so it doesn't scale terribly well. Of course, by the time
you've got several hundred GB of data to work with, your current workload
imbalance issues should be moot anyway.

- Aaron


On Fri, Nov 27, 2009 at 4:33 PM, CubicDesign  wrote:

>
>
> Aaron Kimball wrote:
>
>> (Note: this is a tasktracker setting, not a job setting. you'll need to
>> set this on every
>> node, then restart the mapreduce cluster to take effect.)
>>
>>
> Ok. And here is my mistake. I set this to 16 only on the main node not also
> on data nodes. Thanks a lot!!
>
>  Of course, you need to have enough RAM to make sure that all these tasks
>> can
>> run concurrently without swapping.
>>
> No problem!
>
>
>  If your individual records require around a minute each to process as you
>> claimed earlier, you're
>> nowhere near in danger of hitting that particular performance bottleneck.
>>
>>
>>
> I was thinking that is I am under the recommended value of 64MB, Hadoop
> cannot properly calculate the number of tasks.
>


Re: Processing 10MB files in Hadoop

2009-11-27 Thread CubicDesign



Aaron Kimball wrote:

(Note: this is a tasktracker setting, not a job setting. you'll need to set 
this on every
node, then restart the mapreduce cluster to take effect.)
  
Ok. And here is my mistake. I set this to 16 only on the main node not 
also on data nodes. Thanks a lot!!

Of course, you need to have enough RAM to make sure that all these tasks can
run concurrently without swapping.

No problem!


If your individual records require around a minute each to process as you 
claimed earlier, you're
nowhere near in danger of hitting that particular performance bottleneck.

  
I was thinking that is I am under the recommended value of 64MB, Hadoop 
cannot properly calculate the number of tasks.


Re: Processing 10MB files in Hadoop

2009-11-27 Thread CubicDesign

3 records in 10MB files.
Files can vary and the number of records also can vary.





If the data is 10MB and you have 30k records, and it takes ~2 mins to
process each record, I'd suggest using map to distribute the data across
several reducers then do the actual processing on reduce.
Hmmm... Good idea. Thanks. But is 'Reduce' optimized to do the heavy 
part of the computation?


Re: Processing 10MB files in Hadoop

2009-11-27 Thread Patrick Angeles
What does the data look like?

You mention 30k records, is that for 10MB or for 600MB, or do you have a
constant 30k records with vastly varying file sizes?

If the data is 10MB and you have 30k records, and it takes ~2 mins to
process each record, I'd suggest using map to distribute the data across
several reducers then do the actual processing on reduce.



On Fri, Nov 27, 2009 at 7:07 PM, CubicDesign  wrote:

> Ok. I have set the number on maps to about 1760 (11 nodes * 16 cores/node *
> 10 as recommended by Hadoop documentation) and my job still takes several
> hours to run instead of one.
>
> Can be the overhead added by Hadoop that big? I mean I have over 3
> small tasks (about one minute), each one starting its own JVM.
>
>
>


Re: Processing 10MB files in Hadoop

2009-11-27 Thread CubicDesign
Ok. I have set the number on maps to about 1760 (11 nodes * 16 
cores/node * 10 as recommended by Hadoop documentation) and my job still 
takes several hours to run instead of one.


Can be the overhead added by Hadoop that big? I mean I have over 3 
small tasks (about one minute), each one starting its own JVM.





Re: part-00000.deflate as output

2009-11-27 Thread Mark Kerzner
Thank you, guys, for your very useful answers

Mark

On Fri, Nov 27, 2009 at 12:44 PM, Aaron Kimball  wrote:

> You are always free to run with compression disabled. But in many
> production
> situations, space or performance concerns dictate that all data sets are
> stored compressed, so I think Tim was assuming that you might be operating
> in such an environment -- in which case, you'd only need things to appear
> in
> plaintext if a human operator is inspecting the output for debugging.
>
> - Aaron
>
> On Thu, Nov 26, 2009 at 4:59 PM, Mark Kerzner 
> wrote:
>
> > It worked!
> >
> > But why is it "for testing?" I only have one job, so I need by related as
> > text, can I use this fix all the time?
> >
> > Thank you,
> > Mark
> >
> > On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer  wrote:
> >
> > > For testing purposes you can also try to disable the compression:
> > >
> > > conf.setBoolean("mapred.output.compress", false);
> > >
> > > Then you can look at the output.
> > >
> > > - tim
> > >
> > >
> > > Amogh Vasekar wrote:
> > >
> > >> Hi,
> > >> ".deflate" is the default compression codec used when parameter to
> > >> generate compressed output is true ( mapred.output.compress ).
> > >> You may set the codec to be used via mapred.output.compression.codec,
> > some
> > >> commonly used are available in hadoop.io.compress package...
> > >>
> > >> Amogh
> > >>
> > >>
> > >> On 11/26/09 11:03 AM, "Mark Kerzner"  wrote:
> > >>
> > >> Hi,
> > >>
> > >> I get this part-0.deflate instead of part-0.
> > >>
> > >> How do I get rid of the deflate option?
> > >>
> > >> Thank you,
> > >> Mark
> > >>
> > >>
> > >>
> > >>
> > >
> >
>


Re: part-00000.deflate as output

2009-11-27 Thread Patrick Angeles
You can always do

hadoop fs -text 

This will 'cat' the file for you, and decompress it if necessary.

On Thu, Nov 26, 2009 at 7:59 PM, Mark Kerzner  wrote:

> It worked!
>
> But why is it "for testing?" I only have one job, so I need by related as
> text, can I use this fix all the time?
>
> Thank you,
> Mark
>
> On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer  wrote:
>
> > For testing purposes you can also try to disable the compression:
> >
> > conf.setBoolean("mapred.output.compress", false);
> >
> > Then you can look at the output.
> >
> > - tim
> >
> >
> > Amogh Vasekar wrote:
> >
> >> Hi,
> >> ".deflate" is the default compression codec used when parameter to
> >> generate compressed output is true ( mapred.output.compress ).
> >> You may set the codec to be used via mapred.output.compression.codec,
> some
> >> commonly used are available in hadoop.io.compress package...
> >>
> >> Amogh
> >>
> >>
> >> On 11/26/09 11:03 AM, "Mark Kerzner"  wrote:
> >>
> >> Hi,
> >>
> >> I get this part-0.deflate instead of part-0.
> >>
> >> How do I get rid of the deflate option?
> >>
> >> Thank you,
> >> Mark
> >>
> >>
> >>
> >>
> >
>


Re: Processing 10MB files in Hadoop

2009-11-27 Thread Aaron Kimball
More importantly: have you told Hadoop to use all your cores?

What is mapred.tasktracker.map.tasks.maximum set to? This defaults to 2. If
you've got 16 cores/node, you should set this to at least 15--16 so that all
your cores are being used. You may need to set this higher, like 20, to
ensure that cores aren't being starved. Measure with ganglia or top to make
sure your CPU utilization is up to where you're satisfied. (Note: this is a
tasktracker setting, not a job setting. you'll need to set this on every
node, then restart the mapreduce cluster to take effect.)

Of course, you need to have enough RAM to make sure that all these tasks can
run concurrently without swapping. Swapping will destroy your performance.
Then again, if you bought 16-way machines, presumably you didn't cheap out
in that department :)

100 tasks is not an absurd number. For large data sets (e.g., TB scale), I
have seen several tens of thousands of tasks.

In general, yes, running many tasks over small files is not a good fit for
Hadoop, but 100 is not "many small files" -- you might see some sort of
speed up by coalescing multiple files into a single task, but when you hear
problems with processing many small files, folks are frequently referring to
something like 10,000 files where each file is only a few MB, and the actual
processing per record is extremely cheap. In cases like this, task startup
times severely dominate actual computation time. If your individual records
require around a minute each to process as you claimed earlier, you're
nowhere near in danger of hitting that particular performance bottleneck.

- Aaron


On Thu, Nov 26, 2009 at 12:23 PM, CubicDesign  wrote:

>
>
>  Are the record processing steps bound by a local machine resource - cpu,
>> disk io or other?
>>
>>
> Some disk I/O. Not so much compared with the CPU. Basically it is a CPU
> bound. This is why each machine has 16 cores.
>
>  What I often do when I have lots of small files to handle is use the
>> NlineInputFormat,
>>
> Each file contains a complete/independent set of records. I cannot mix the
> data resulted from processing two different files.
>
>
> -
> Ok. I think I need to re-explain my problem :)
> While running jobs on these small files, the computation time was almost 5
> times longer than expected. It looks like the job was affected by the number
> of map task that I have (100). I don't know which are the best parameters in
> my case (10MB files).
>
> I have zero reduce tasks.
>
>
>


Re: Good idea to run NameNode and JobTracker on same machine?

2009-11-27 Thread Aaron Kimball
The real kicker is going to be memory consumption of one or both of these
services. The NN in particular uses a large amount of RAM to store the
filesystem image. I think that those who are suggesting a breakeven point of
<= 10 nodes are lowballing. In practice, unless your machines are really
thin on the RAM (e.g., 2--4 GB), I haven't seen any cases where these
services need to be separated below the 20-node mark; I've also seen several
clusters of 40 nodes running fine with these services colocated. It depends
on how many files are in HDFS and how frequently you're submitting a lot of
concurrent jobs to MapReduce.

If you're setting up a production environment that you plan to expand,
however, as a best practice you should configure the master node to have two
hostnames (e.g., "nn" and "jt") so that you can have separate hostnames in
fs.default.name and mapred.job.tracker; when the day comes that these
services are placed on different nodes, you'll then be able to just move one
of the hostnames over and not need to reconfigure all 20--40 other nodes.

- Aaron

On Thu, Nov 26, 2009 at 8:27 PM, Srigurunath Chakravarthi <
srig...@yahoo-inc.com> wrote:

> Raymond,
> Load wise, it should be very safe to run both JT and NN on a single node
> for small clusters (< 40 Task Trackers and/or Data Nodes). They don't use
> much CPU as such.
>
>  This may even work for larger clusters depending on the type of hardware
> you have and the Hadoop job mix. We usually observe < 5% CPU load with ~80
> DNs/TTs on an 8-code Intel processor based box with 16GB RAM.
>
>  It is best that you observe CPU & mem load on the JT+NN node to take a
> call on whether to separate them. iostat, top or sar should tell you.
>
> Regards,
> Sriguru
>
> >-Original Message-
> >From: John Martyniak [mailto:j...@beforedawnsolutions.com]
> >Sent: Friday, November 27, 2009 3:06 AM
> >To: common-user@hadoop.apache.org
> >Cc: 
> >Subject: Re: Good idea to run NameNode and JobTracker on same machine?
> >
> >I have a cluster of 4 machines plus one machine to run nn & jt.  I
> >have heard that 5 or 6 is the magic #.  I will see when I add the next
> >batch of machines.
> >
> >And it seems to running fine.
> >
> >-Jogn
> >
> >On Nov 26, 2009, at 11:38 AM, Yongqiang He 
> >wrote:
> >
> >> I think it is definitely not a good idea to combine these two in
> >> production
> >> environment.
> >>
> >> Thanks
> >> Yongqiang
> >> On 11/26/09 6:26 AM, "Raymond Jennings III" 
> >> wrote:
> >>
> >>> Do people normally combine these two processes onto one machine?
> >>> Currently I
> >>> have them on separate machines but I am wondering they use that
> >>> much CPU
> >>> processing time and maybe I should combine them and create another
> >>> DataNode.
> >>>
> >>>
> >>>
> >>>
> >>>
> >>
> >>
>


Re: part-00000.deflate as output

2009-11-27 Thread Aaron Kimball
You are always free to run with compression disabled. But in many production
situations, space or performance concerns dictate that all data sets are
stored compressed, so I think Tim was assuming that you might be operating
in such an environment -- in which case, you'd only need things to appear in
plaintext if a human operator is inspecting the output for debugging.

- Aaron

On Thu, Nov 26, 2009 at 4:59 PM, Mark Kerzner  wrote:

> It worked!
>
> But why is it "for testing?" I only have one job, so I need by related as
> text, can I use this fix all the time?
>
> Thank you,
> Mark
>
> On Thu, Nov 26, 2009 at 1:10 AM, Tim Kiefer  wrote:
>
> > For testing purposes you can also try to disable the compression:
> >
> > conf.setBoolean("mapred.output.compress", false);
> >
> > Then you can look at the output.
> >
> > - tim
> >
> >
> > Amogh Vasekar wrote:
> >
> >> Hi,
> >> ".deflate" is the default compression codec used when parameter to
> >> generate compressed output is true ( mapred.output.compress ).
> >> You may set the codec to be used via mapred.output.compression.codec,
> some
> >> commonly used are available in hadoop.io.compress package...
> >>
> >> Amogh
> >>
> >>
> >> On 11/26/09 11:03 AM, "Mark Kerzner"  wrote:
> >>
> >> Hi,
> >>
> >> I get this part-0.deflate instead of part-0.
> >>
> >> How do I get rid of the deflate option?
> >>
> >> Thank you,
> >> Mark
> >>
> >>
> >>
> >>
> >
>


Re: Re: Doubt in Hadoop

2009-11-27 Thread Aaron Kimball
When you set up the Job object, do you call job.setJarByClass(Map.class)?
That will tell Hadoop which jar file to ship with the job and to use for
classloading in your code.

- Aaron


On Thu, Nov 26, 2009 at 11:56 PM,  wrote:

> Hi,
>   I am running the job from command line. The job runs fine in the local
> mode
> but something happens when I try to run the job in the distributed mode.
>
>
> Abhishek Agrawal
>
> SUNY- Buffalo
> (716-435-7122)
>
> On Fri 11/27/09  2:31 AM , Jeff Zhang zjf...@gmail.com sent:
> > Do you run the map reduce job in command line or IDE?  in map reduce
> > mode, you should put the jar containing the map and reduce class in
> > your classpath
> > Jeff Zhang
> > On Fri, Nov 27, 2009 at 2:19 PM,   wrote:
> > Hello Everybody,
> >I have a doubt in Haddop and was wondering if
> > anybody has faced a
> > similar problem. I have a package called test. Inside that I have
> > class called
> > A.java, Map.java, Reduce.java. In A.java I have the main method
> > where I am trying
> > to initialize the jobConf object. I have written
> > jobConf.setMapperClass(Map.class) and similarly for the reduce class
> > as well. The
> > code works correctly when I run the code locally via
> > jobConf.set("mapred.job.tracker","local") but I get an exception
> > when I try to
> > run this code on my cluster. The stack trace of the exception is as
> > under. I
> > cannot understand the problem. Any help would be appreciated.
> > java.lang.RuntimeException: java.lang.RuntimeException:
> > java.lang.ClassNotFoundException: test.Map
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:752)
> >at
> > org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:690)
> >at
> > org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
> >at
> > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
> >at
> >
> >
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
> >at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
> >at org.apache.hadoop.mapred.Child.main(Child.java:158)
> > Caused by: java.lang.RuntimeException:
> > java.lang.ClassNotFoundException:
> > Markowitz.covarMatrixMap
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:720)
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:744)
> >... 6 more
> > Caused by: java.lang.ClassNotFoundException: test.Map
> >at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
> >at java.security.AccessController.doPrivileged(Native
> > Method)
> >at
> > java.net.URLClassLoader.findClass(URLClassLoader.java:188)
> >at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> >at
> > sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
> >at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
> >at
> > java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
> >at java.lang.Class.forName0(Native Method)
> >at java.lang.Class.forName(Class.java:247)
> >at
> >
> >
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:673)
> >at
> > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:718)
> >... 7 more
> > Thank You
> > Abhishek Agrawal
> > SUNY- Buffalo
> > (716-435-7122)
> >
> >
>
>


Re: RE: please help in setting hadoop

2009-11-27 Thread Aaron Kimball
You've set hadoop.tmp.dir to /home/hadoop/hadoop-${user.name}.

This means that on every node, you're going to need a directory named (e.g.)
/home/hadoop/hadoop-root/, since it seems as though you're running things as
root (in general, not a good policy; but ok if you're on EC2 or something
like that).

mapred.local.dir defaults to ${hadoop.tmp.dir}/mapred/local. You've
confirmed that this exists on the machine named 'master' -- what about on
slave?

Then, what are the permissions of /home/hadoop/ on the slave node? Whichever
user started the Hadoop daemons (probably either 'root' or 'hadoop') will
need the ability to mkdir /home/hadoop/hadoop-root underneath of
/home/hadoop. If that directory doesn't exist, or is chown'd to someone
else, this will probably be the result.

- Aaron


On Thu, Nov 26, 2009 at 10:22 PM,  wrote:

> Hi,
>   There should be a folder called as logs in $HADOOP_HOME. Also try going
> through
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
> .
>
>
> This is a pretty good tutorial
>
> Abhishek Agrawal
>
> SUNY- Buffalo
> (716-435-7122)
>
> On Fri 11/27/09  1:18 AM , "Krishna Kumar" krishna.ku...@nechclst.in sent:
> > I have tried, but didn't get any success. In bwt can you please tell
> exact
> > path of log file which I have to refer.
> >
> >
> > Thanks and Best Regards,
> >
> > Krishna Kumar
> >
> > Senior Storage Engineer
> >
> > Why do we have to die? If we had to die, and everything is gone after
> that,
> > then nothing else matters on this earth - everything is temporary, at
> least
> > relative to me.
> >
> >
> >
> >
> > -Original Message-
> >
> > From: aa...@buffalo.edu [aa...@buffa
> > lo.edu]
> > Sent: Friday, November 27, 2009 10:56 AM
> >
> > To: common-user@hadoop.apache.org
> > Subject: Re: please help in setting hadoop
> >
> >
> >
> > Hi,
> >
> > Just a thought, but you do not need to setup the temp directory in
> >
> > conf/hadoop-site.xml especially if you are running basic examples. Give
> > that a
> > shot, maybe it will work out. Otherwise see if you can find additional
> info
> > in
> > the LOGS
> >
> >
> >
> > Thank You
> >
> >
> >
> > Abhishek Agrawal
> >
> >
> >
> > SUNY- Buffalo
> >
> > (716-435-7122)
> >
> >
> >
> > On Fri 11/27/09 12:20 AM , "Krishna Kumar" kri
> > shna.ku...@nechclst.in sent:
> > > Dear All,
> >
> > > Can anybody please help me in getting out from
> > these error messages:
> > > [ hadoop]# hadoop jar
> >
> > >
> > /usr/lib/hadoop/hadoop-0.18.3-14.cloudera.CH0_3-examples.jar
> > > wordcount
> >
> > > test test-op
> >
> > >
> >
> > > 09/11/26 17:15:45 INFO mapred.FileInputFormat:
> > Total input paths to
> > > process : 4
> >
> > >
> >
> > > 09/11/26 17:15:45 INFO mapred.FileInputFormat:
> > Total input paths to
> > > process : 4
> >
> > >
> >
> > > org.apache.hadoop.ipc.RemoteException:
> > java.io.IOException: No valid
> > > local directories in property: mapred.local.dir
> >
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:730
> > > )
> >
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:222)
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:194)
> > >
> >
> > > at
> >
> > >
> > org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1557)
> > >
> >
> > > at
> > sun.reflect.NativeMethodAccessorImpl.invoke0(Native
> > > Method)
> >
> > >
> >
> > > at
> >
> > >
> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.jav
> > > a:39)
> >
> > >
> >
> > > at
> >
> > >
> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
> > > Impl.java:25)
> >
> > >
> >
> > > at
> > java.lang.reflect.Method.invoke(Method.java:585)
> > >
> >
> > > at
> > org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
> > >
> >
> > > at
> > org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
> > > I am running the hadoop cluster as root user on
> > two server nodes:
> > > master
> >
> > > and slave.  My hadoop-site.xml file format is as
> > follows :
> > > fs.default.name
> >
> > >
> >
> > > hdfs://master:54310
> > > dfs.permissions
> >
> > >
> >
> > > false
> >
> > > dfs.name.dir
> >
> > >
> >
> > > /home/hadoop/dfs/name
> >
> > > Further the o/p of ls command is as follows:
> >
> > >
> >
> > > [ hadoop]# ls -l /home/hadoop/hadoop-root/
> >
> > >
> >
> > > total 8
> >
> > >
> >
> > > drwxr-xr-x 4 root root 4096 Nov 26 16:48 dfs
> >
> > >
> >
> > > drwxr-xr-x 3 root root 4096 Nov 26 16:49 mapred
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]# ls -l
> > /home/hadoop/hadoop-root/mapred/
> > >
> >
> > > total 4
> >
> > >
> >
> > > drwxr-xr-x 2 root root 4096 Nov 26 16:49 local
> >
> > >
> >
> > > [ hadoop]#
> >
> > >
> >
> > > [ hadoop]# ls -l
> > /home/hadoop/hadoop-root/mapred/local/
> > >
> >
> > > total 0
> >
> > > Thanks and Best Regards,
> >
> > >
> >
> > > Krishna Kumar
> >
> > >
> >
> > 

Re: Hadoop 0.20 map/reduce Failing for old API

2009-11-27 Thread Edward Capriolo
On Fri, Nov 27, 2009 at 10:46 AM, Arv Mistry  wrote:
> Thanks Rekha, I was missing the new library
> (hadoop-0.20.1-hdfs-core.jar) in my client.
>
> It seems to run a little further but I'm now getting a
> ClassCastException returned by the mapper. Note, this worked with the
> 0.19 load, so I'm assuming there's something additional in the
> configuration that I'm missing. Can anyone help?
>
> java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit
> cannot be cast to org.apache.hadoop.mapred.FileSplit
>        at
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat
> .java:54)
>        at
> org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
>        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
>        at org.apache.hadoop.mapred.Child.main(Child.java:170)
>
> Cheers Arv
>
> -Original Message-
> From: Rekha Joshi [mailto:rekha...@yahoo-inc.com]
> Sent: November 26, 2009 11:45 PM
> To: common-user@hadoop.apache.org
> Subject: Re: Hadoop 0.20 map/reduce Failing for old API
>
> The exit status of 1 usually indicates configuration issues, incorrect
> command invocation in hadoop 0.20 (incorrect params), if not JVM crash.
> In your logs there is no indication of crash, but some paths/command can
> be the cause. Can you check if your lib paths/data paths are correct?
>
> If it is a memory intensive task, you may also try values on
> mapred.child.java.opts /mapred.job.map.memory.mb.Thanks!
>
> On 11/27/09 1:28 AM, "Arv Mistry"  wrote:
>
> Hi,
>
> We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
> working fine, but the map/reduce jobs are failing with the following
> exception. Note, we have not moved to the new map/reduce API yet. In the
> client that launches the job, the only change I have made is to now load
> the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
> than the hadoop-site.xml. Any ideas?
>
> INFO   | jvm 1    | 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
> [FileInputFormat] Total input paths to process : 711
> INFO   | jvm 1    | 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
> [JobClient] Running job: job_200911241319_0003
> INFO   | jvm 1    | 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
> [JobClient]  map 0% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_03_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:47:36 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:47:36 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:47:36 |
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_03_0&filter=stdout
> INFO   | jvm 1    | 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_03_0&filter=stderr
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_00_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:47:51 | java.io.IOException: Task
> process exit with nonzero status of 1.
> INFO   | jvm 1    | 2009/11/26 13:47:51 |       at
> org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> INFO   | jvm 1    | 2009/11/26 13:47:51 |
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_00_0&filter=stdout
> INFO   | jvm 1    | 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
> [JobClient] Error reading task
> outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
> d=attempt_200911241319_0003_m_00_0&filter=stderr
> INFO   | jvm 1    | 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
> [JobClient]  map 50% reduce 0%
> INFO   | jvm 1    | 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
> [JobClient] Task Id : attempt_200911241319_0003_m_01_0, Status :
> FAILED
> INFO   | jvm 1    | 2009/11/26 13:48:03 | Map output lost, rescheduling:
> getMapOutput(attempt_200911241319_0003_m_01_0,0) failed :
> INFO   | jvm 1    | 2009/11/26 13:48:03 |
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
> 1_0/output/file.out.index in any of the configured local directories
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
> oRead(LocalDirAllocator.java:389)
> INFO   | jvm 1    | 2009/11/26 13:48:03 |       at
> org.apache.hadoop.fs.LocalDirA

RE: Hadoop 0.20 map/reduce Failing for old API

2009-11-27 Thread Arv Mistry
Thanks Rekha, I was missing the new library
(hadoop-0.20.1-hdfs-core.jar) in my client.

It seems to run a little further but I'm now getting a
ClassCastException returned by the mapper. Note, this worked with the
0.19 load, so I'm assuming there's something additional in the
configuration that I'm missing. Can anyone help?

java.lang.ClassCastException: org.apache.hadoop.mapred.MultiFileSplit
cannot be cast to org.apache.hadoop.mapred.FileSplit
at
org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat
.java:54)
at
org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:338)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

Cheers Arv

-Original Message-
From: Rekha Joshi [mailto:rekha...@yahoo-inc.com] 
Sent: November 26, 2009 11:45 PM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop 0.20 map/reduce Failing for old API

The exit status of 1 usually indicates configuration issues, incorrect
command invocation in hadoop 0.20 (incorrect params), if not JVM crash.
In your logs there is no indication of crash, but some paths/command can
be the cause. Can you check if your lib paths/data paths are correct?

If it is a memory intensive task, you may also try values on
mapred.child.java.opts /mapred.job.map.memory.mb.Thanks!

On 11/27/09 1:28 AM, "Arv Mistry"  wrote:

Hi,

We've recently upgraded to hadoop 0.20. Writing to HDFS seems to be
working fine, but the map/reduce jobs are failing with the following
exception. Note, we have not moved to the new map/reduce API yet. In the
client that launches the job, the only change I have made is to now load
the three files; core-site.xml, hdfs-site.xml and mapred-site.xml rather
than the hadoop-site.xml. Any ideas?

INFO   | jvm 1| 2009/11/26 13:47:26 | 2009-11-26 13:47:26,328 INFO
[FileInputFormat] Total input paths to process : 711
INFO   | jvm 1| 2009/11/26 13:47:28 | 2009-11-26 13:47:28,033 INFO
[JobClient] Running job: job_200911241319_0003
INFO   | jvm 1| 2009/11/26 13:47:29 | 2009-11-26 13:47:29,036 INFO
[JobClient]  map 0% reduce 0%
INFO   | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,068 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_03_0, Status :
FAILED
INFO   | jvm 1| 2009/11/26 13:47:36 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1| 2009/11/26 13:47:36 |   at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1| 2009/11/26 13:47:36 |
INFO   | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,094 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_03_0&filter=stdout
INFO   | jvm 1| 2009/11/26 13:47:36 | 2009-11-26 13:47:36,096 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_03_0&filter=stderr
INFO   | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,162 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_00_0, Status :
FAILED
INFO   | jvm 1| 2009/11/26 13:47:51 | java.io.IOException: Task
process exit with nonzero status of 1.
INFO   | jvm 1| 2009/11/26 13:47:51 |   at
org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
INFO   | jvm 1| 2009/11/26 13:47:51 |
INFO   | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,166 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_00_0&filter=stdout
INFO   | jvm 1| 2009/11/26 13:47:51 | 2009-11-26 13:47:51,167 WARN
[JobClient] Error reading task
outputhttp://dev-cs1.ca.kindsight.net:50060/tasklog?plaintext=true&taski
d=attempt_200911241319_0003_m_00_0&filter=stderr
INFO   | jvm 1| 2009/11/26 13:47:52 | 2009-11-26 13:47:52,173 INFO
[JobClient]  map 50% reduce 0%
INFO   | jvm 1| 2009/11/26 13:48:03 | 2009-11-26 13:48:03,219 INFO
[JobClient] Task Id : attempt_200911241319_0003_m_01_0, Status :
FAILED
INFO   | jvm 1| 2009/11/26 13:48:03 | Map output lost, rescheduling:
getMapOutput(attempt_200911241319_0003_m_01_0,0) failed :
INFO   | jvm 1| 2009/11/26 13:48:03 |
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200911241319_0003/attempt_200911241319_0003_m_0
1_0/output/file.out.index in any of the configured local directories
INFO   | jvm 1| 2009/11/26 13:48:03 |   at
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathT
oRead(LocalDirAllocator.java:389)
INFO   | jvm 1| 2009/11/26 13:48:03 |   at
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAlloca
tor.java:138)
INFO   | jvm 1| 2009/11/26 13:48:03 |   at
org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.
java:2886)
INFO   | jvm 1| 2009/11/26 13:48:03 |   

Re: AW: KeyValueTextInputFormat and Hadoop 0.20.1

2009-11-27 Thread Rekha Joshi
https://issues.apache.org/jira/browse/MAPREDUCE-655 fixed in version 0.21.0

On 11/26/09 9:43 PM, "Matthias Scherer"  wrote:

Sorry, but I can't find it in the version control system for release 0.20.1: 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1/src/mapred/org/apache/hadoop/mapreduce/lib/input/

Du you have another distribution?

Regards,
Matthias


> -Ursprüngliche Nachricht-
> Von: Jeff Zhang [mailto:zjf...@gmail.com]
> Gesendet: Donnerstag, 26. November 2009 16:35
> An: common-user@hadoop.apache.org
> Betreff: Re: KeyValueTextInputFormat and Hadoop 0.20.1
>
> There's a KeyValueInputFormat under package
> org.apache.hadoop.mapreduce.lib.input
> which is for hadoop new API
>
>
> Jeff Zhang
>
>
> On Thu, Nov 26, 2009 at 7:10 AM, Matthias Scherer
>  > wrote:
>
> > Hi,
> >
> > I started my first experimental Hadoop project with Hadoop
> 0.20.1 an
> > run in the following problem:
> >
> > Job job = new Job(new Configuration(),"Myjob");
> > job.setInputFormatClass(KeyValueTextInputFormat.class);
> >
> > The last line throws the following error: "The method
> > setInputFormatClass(Class) in the
> type Job is
> > not applicable for the arguments (Class)"
> >
> > Job.setInputFormatClass expects a subclass of the new class
> > org.apache.hadoop.mapreduce.InputFormat. But
> KeyValueTextInputFormat
> > is only available as subclass of the deprecated
> > org.apache.hadoop.mapred.FileInputFormat.
> >
> > Is there a way to use KeyValueTextInputFormat with the new
> classes Job
> > and Configuration?
> >
> > Thanks,
> > Matthias
> >
>