Re: combine two map tasks

2009-06-28 Thread jason hadoop
The ChainMapper class introduced in Hadoop 19 will provide you with the
ability to have an arbitrary number of map tasks to run one after the other,
in the context of a single job.
The one issue to be aware of is that the chain of mappers only see the
output the previous map in the chain.

There is a nice discussion of this in chapter 8 of Pro Hadoop, by Apress.com

On Sun, Jun 28, 2009 at 5:04 AM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 See this .. hope this answers your question .

 http://developer.yahoo.com/hadoop/tutorial/module4.html#tips

 On Sun, Jun 28, 2009 at 5:28 PM, bonito bonito.pe...@gmail.com wrote:

 
  Hello!
  I am a new hadoop user and my question may sound naive..
  However, I would like to ask if there is a way to combine the results of
  two
  mpa tasks that may run simultaneously.
  I use the MultipleInput class and thus I have two different mappers.
  I want the result/output of the one map (associated with one input file)
 to
  be used in the process of the second map (associated with the second
 input
  file).
  I have thought of storing the map1 output in the hdfs and retrieving it
  using the map2.
  However, I have no clue whether this is possible. I mean...what about
  time-executing issues? map2 has to wait until map1 is completed...
 
  The thought of executing them in a serial manner is not the one I really
  want...
 
  Any suggestion would be appreciated.
  Thank you in advance :)
 
  --
  View this message in context:
  http://www.nabble.com/combine-two-map-tasks-tp24240928p24240928.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Scaling out/up or a mix

2009-06-27 Thread jason hadoop
How about multi-threaded mappers?
Multi-Threaded mappers are ideal for map tasks that are non locally io bound
with many distinct endpoints.
You can also control the thread count on a per job basis.

On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou marcus.he...@tailsweep.comwrote:

 The argument currently against increasing num-mappers is that the machines
 will get into oom and since a lot of the jobs are crawlers I need more
 ip-numbers so I don't get banned :)

 Thing is that we currently have solr on the very same machines and
 data-nodes as well so I can only give the MR nodes about 1G memory since I
 need SOLR to have 4G...

 Now I see that I should get some obvious and juste critique about the
 layout
 of this arch but I'm a little limited in budget and so is then the arch :)

 However is it wise to have the MR tasks on the same nodes as the data-nodes
 or should I split the arch ? I mean the data-nodes perhaps need more
 disk-IO
 and the MR more memory and CPU ?

 Trying to find a sweetspot hardware spec of those two roles.

 //Marcus



 On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman bbock...@cse.unl.edu
 wrote:

  Hey Marcus,
 
  Are you recording the data rates coming out of HDFS?  Since you have such
 a
  low CPU utilizations, I'd look at boxes utterly packed with big hard
 drives
  (also, why are you using RAID1 for Hadoop??).
 
  You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
   Based on the data rates you see, make the call.
 
  On the other hand, what's the argument against running 3x more mappers
 per
  box?  It seems that your boxes still have more overhead to use -- there's
 no
  I/O wait.
 
  Brian
 
 
  On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
 
   Hi.
 
  We have a deployment of 10 hadoop servers and I now need more mapping
  capability (no not just add more mappers per instance) since I have so
  many
  jobs running. Now I am wondering what I should aim on...
  Memory, cpu or disk... How long is a rope perhaps you would say ?
 
  A typical server is currently using about 15-20% cpu today on a
 quad-core
  2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
 
  Some specs below.
 
  mpstat 2 5
 
  Linux 2.6.24-19-server (mapreduce2) 06/26/2009
 
  11:36:13 PM  CPU   %user   %nice%sys %iowait%irq   %soft  %steal
  %idleintr/s
  11:36:15 PM  all   22.820.003.241.370.622.490.00
  69.45   8572.50
  11:36:17 PM  all   13.560.001.741.990.622.610.00
  79.48   8075.50
  11:36:19 PM  all   14.320.002.241.121.122.240.00
  78.95   9219.00
  11:36:21 PM  all   14.710.000.871.620.251.750.00
  80.80   8489.50
  11:36:23 PM  all   12.690.000.871.240.500.750.00
  83.96   5495.00
  Average: all   15.620.001.791.470.621.970.00
  78.53   7970.30
 
  What I am thinking is... Is it wiser to go for many of these cheap boxes
  with 8GB of RAM or should I for instance focus on machines which can
 give
  more I|O throughput ?
 
  I know that these things are hard but perhaps someone have draw some
  conclusions before the pragmatic way.
 
  Kindly
 
  //Marcus
 
 
  --
  Marcus Herou CTO and co-founder Tailsweep AB
  +46702561312
  marcus.he...@tailsweep.com
  http://www.tailsweep.com/
 
 
 


 --
 Marcus Herou CTO and co-founder Tailsweep AB
 +46702561312
 marcus.he...@tailsweep.com
 http://www.tailsweep.com/




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Are .bz2 extensions supported in Hadoop 18.3

2009-06-24 Thread jason hadoop
I believe the cloudera 18.3 supports bzip2

On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote:

 Hi All,

 Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3?
 I tried but interestingly the output was not what i expected versus what i
 got when my data was in uncompressed format.

 Thanks,
 Usman




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
The join package does a streaming merge sort between each part-X in your
input directories,
part- will be handled a single task,
part-0001 will be handled in a single task
and so on
These jobs are essentially io bound, and hard to beat for performance.

On Wed, Jun 24, 2009 at 2:09 PM, pmg parmod.me...@gmail.com wrote:


 I have two files FileA (with 600K records) and FileB (With 2million
 records)

 FileA has a key which is same of all the records

 123724101722493
 123781676672721

 FileB has the same key as FileA

 1235026328101569
 1235026328001562

 Using hadoop join package I can create output file with tuples and cross
 product of FileA and FileB.

 123[724101722493,5026328101569]
 123[724101722493,5026328001562]
 123[781676672721,5026328101569]
 123[781676672721,5026328001562]

 How does CompositeInputFormat scale when we want to join 600K with 2
 millions records. Does it run on the node with single map/reduce?

 Also how can I not write the result into a file instead input split the
 result into different nodes where I can compare the tuples e.g. comparing
 724101722493 with 5026328101569 using some heuristics.

 thanks
 --
 View this message in context:
 http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: CompositeInputFormat scalbility

2009-06-24 Thread jason hadoop
The input split size is Long.MAX_VALUE.
and in actual fact, the contents of each directory are sorted separately.
The number of directory entries for each has to be identical.
and all files in index position I, where I varies from 0 to the number of
files in a directory, become the input to 1 task.

My book goes into this in some detail with examples.

Without patches mapside join can only handle 32 directories.

On Wed, Jun 24, 2009 at 9:09 PM, pmg parmod.me...@gmail.com wrote:


 And what decides part-, part-0001input split, block size?

 So for example for 1G of data on HDFS with 64m block size get 16 blocks
 mapped to different map tasks?



 jason hadoop wrote:
 
  The join package does a streaming merge sort between each part-X in your
  input directories,
  part- will be handled a single task,
  part-0001 will be handled in a single task
  and so on
  These jobs are essentially io bound, and hard to beat for performance.
 
  On Wed, Jun 24, 2009 at 2:09 PM, pmg parmod.me...@gmail.com wrote:
 
 
  I have two files FileA (with 600K records) and FileB (With 2million
  records)
 
  FileA has a key which is same of all the records
 
  123724101722493
  123781676672721
 
  FileB has the same key as FileA
 
  1235026328101569
  1235026328001562
 
  Using hadoop join package I can create output file with tuples and cross
  product of FileA and FileB.
 
  123[724101722493,5026328101569]
  123[724101722493,5026328001562]
  123[781676672721,5026328101569]
  123[781676672721,5026328001562]
 
  How does CompositeInputFormat scale when we want to join 600K with 2
  millions records. Does it run on the node with single map/reduce?
 
  Also how can I not write the result into a file instead input split the
  result into different nodes where I can compare the tuples e.g.
 comparing
  724101722493 with 5026328101569 using some heuristics.
 
  thanks
  --
  View this message in context:
 
 http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 
 

 --
 View this message in context:
 http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24196664.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Does balancer ensure a file's replication is satisfied?

2009-06-23 Thread jason hadoop
The namenode is constantly receiving reports about what datanode has what
blocks, and performing replication when a block becomes under replicated.

On Tue, Jun 23, 2009 at 6:18 PM, Stuart White stuart.whi...@gmail.comwrote:

 In my Hadoop cluster, I've had several drives fail lately (and they've
 been replaced).  Each time a new empty drive is placed in the cluster,
 I run the balancer.

 I understand that the balancer will redistribute the load of file
 blocks across the nodes.

 My question is: will balancer also look at the desired replication of
 a file, and if the actual replication of a file is less than the
 desired (because the file had blocks stored on the lost drive), will
 balancer re-replicate those lost blocks?

 If not, is there another tool that will ensure the desired replication
 factor of files is satisfied?

 If this functionality doesn't exist, I'm concerned that I'm slowly,
 silently losing my files as I replace drives, and I may not even
 realize it.

 Thoughts?




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Determining input record directory using Streaming...

2009-06-23 Thread jason hadoop
I happened to have a copy of 18.1 lying about, and the JobConf is added to
the per process runtime environment in 18.1.

The entire configuration from the JobConf object is added to the
environment, with the jobconf key names being transformed slightly. Any
character in the key name, that is not one of [a-zA-Z0-9] is replaced by an
'_' character.


On Tue, Jun 23, 2009 at 10:29 AM, Bo Shi b...@deadpuck.net wrote:

 Jason, do you know offhand when this feature was introduced?  .18.x?

 Thanks,
 Bo

 On Mon, Jun 22, 2009 at 10:58 PM, jason hadoopjason.had...@gmail.com
 wrote:
  Check the process environment for your streaming tasks, generally the
  configuration variables are exported into the process environment.
 
  The Mapper input file is normally stored as some variant of
  mapred.input.file. The reducer's input is the mapper output for that
 reduce,
  so the input file is not relevant.
 
  On Mon, Jun 22, 2009 at 7:21 PM, C G parallel...@yahoo.com wrote:
 
  Hi All:
  Is there any way using Hadoop Streaming to determining the directory
 from
  which an input record is being read?  This is straightforward in Hadoop
  using InputFormats, but I am curious if the same concept can be applied
 to
  streaming.
  The goal here is to read in data from 2 directories, say A/ and B/, and
  make decisions about what to do based on where the data is rooted.
  Thanks for any help...CG
 
 
 
 
 
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Strange Exeception

2009-06-22 Thread jason hadoop
The directory specified by the configuration parameter mapred.system.dir,
defaulting to /tmp/hadoop/mapred/system, doesn't exist.

Most likely your tmp cleaner task has removed it, and I am guessing it is
only created at cluster start time.

On Mon, Jun 22, 2009 at 6:19 PM, akhil1988 akhilan...@gmail.com wrote:


 Hi All!

 I have been running Hadoop jobs through my user account on a cluster, for a
 while now. But now I am getting this strange exception when I try to
 execute
 a job. If anyone knows, please let me know why this is happening.

 [akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar
 org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input
 /home/akhil1988/output
 JO
 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the same.
 org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException:
 /hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml
 (Read-only file system)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.init(FileOutputStream.java:179)
at

 org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187)
at

 org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183)
at
 org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241)
at

 org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327)
at
 org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214)
at
 org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195)
at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212)
at
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230)
at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)

at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127)
at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 Thanks,
 Akhil

 --
 View this message in context:
 http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: When is configure and close run

2009-06-22 Thread jason hadoop
configure and close are run for each task, mapper and reducer. The configure
and close are NOT run on the combiner class.

On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 In a mapreduce job, a given map JVM will run N map tasks. Are the
 configure and close methods executed for every one of these N tasks?
 Or is configure executed once when the JVM starts and the close method
 executed once when all N have been completed?

 I have the same question for the reduce task. Will it be run before
 for every reduce task? And close is run when all the values for a
 given key have been processed?

 We can assume there isn't a combiner.

 Regards
 Saptarshi




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Determining input record directory using Streaming...

2009-06-22 Thread jason hadoop
Check the process environment for your streaming tasks, generally the
configuration variables are exported into the process environment.

The Mapper input file is normally stored as some variant of
mapred.input.file. The reducer's input is the mapper output for that reduce,
so the input file is not relevant.

On Mon, Jun 22, 2009 at 7:21 PM, C G parallel...@yahoo.com wrote:

 Hi All:
 Is there any way using Hadoop Streaming to determining the directory from
 which an input record is being read?  This is straightforward in Hadoop
 using InputFormats, but I am curious if the same concept can be applied to
 streaming.
 The goal here is to read in data from 2 directories, say A/ and B/, and
 make decisions about what to do based on where the data is rooted.
 Thanks for any help...CG








-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
HDFS/DFS client uses quite a few file descriptors for each open file.

Many application developers (but not the hadoop core) rely on the JVM
finalizer methods to close open files.

This combination, expecially when many HDFS files are open can result in
very large demands for file descriptors for Hadoop clients.
We as a general rule never run a cluster with nofile less that 64k, and for
larger clusters with demanding applications have had it set 10x higher. I
also believe there was a set of JVM versions that leaked file descriptors
used for NIO in the HDFS core. I do not recall the exact details.

On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 After tracing some more with the lsof utility, and I managed to stop the
 growth on the DataNode process, but still have issues with my DFS client.

 It seems that my DFS client opens hundreds of pipes and eventpolls. Here is
 a small part of the lsof output:

 java10508 root  387w  FIFO0,6   6142565 pipe
 java10508 root  388r  FIFO0,6   6142565 pipe
 java10508 root  389u     0,100  6142566
 eventpoll
 java10508 root  390u  FIFO0,6   6135311 pipe
 java10508 root  391r  FIFO0,6   6135311 pipe
 java10508 root  392u     0,100  6135312
 eventpoll
 java10508 root  393r  FIFO0,6   6148234 pipe
 java10508 root  394w  FIFO0,6   6142570 pipe
 java10508 root  395r  FIFO0,6   6135857 pipe
 java10508 root  396r  FIFO0,6   6142570 pipe
 java10508 root  397r     0,100  6142571
 eventpoll
 java10508 root  398u  FIFO0,6   6135319 pipe
 java10508 root  399w  FIFO0,6   6135319 pipe

 I'm using FSDataInputStream and FSDataOutputStream, so this might be
 related
 to pipes?

 So, my questions are:

 1) What happens these pipes/epolls to appear?

 2) More important, how I can prevent their accumation and growth?

 Thanks in advance!

 2009/6/21 Stas Oskin stas.os...@gmail.com

  Hi.
 
  I have HDFS client and HDFS datanode running on same machine.
 
  When I'm trying to access a dozen of files at once from the client,
 several
  times in a row, I'm starting to receive the following errors on client,
 and
  HDFS browse function.
 
  HDFS Client: Could not get block locations. Aborting...
  HDFS browse: Too many open files
 
  I can increase the maximum number of files that can opened, as I have it
  set to the default 1024, but would like to first solve the problem, as
  larger value just means it would run out of files again later on.
 
  So my questions are:
 
  1) Does the HDFS datanode keeps any files opened, even after the HDFS
  client have already closed them?
 
  2) Is it possible to find out, who keeps the opened files - datanode or
  client (so I could pin-point the source of the problem).
 
  Thanks in advance!
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
Just to be clear, I second Brian's opinion. Relying on finalizes is a very
good way to run out of file descriptors.

On Sun, Jun 21, 2009 at 9:32 AM, brian.lev...@nokia.com wrote:

 IMHO, you should never rely on finalizers to release scarce resources since
 you don't know when the finalizer will get called, if ever.

 -brian



 -Original Message-
 From: ext jason hadoop [mailto:jason.had...@gmail.com]
 Sent: Sunday, June 21, 2009 11:19 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Too many open files error, which gets resolved after some
 time

 HDFS/DFS client uses quite a few file descriptors for each open file.

 Many application developers (but not the hadoop core) rely on the JVM
 finalizer methods to close open files.

 This combination, expecially when many HDFS files are open can result in
 very large demands for file descriptors for Hadoop clients.
 We as a general rule never run a cluster with nofile less that 64k, and for
 larger clusters with demanding applications have had it set 10x higher. I
 also believe there was a set of JVM versions that leaked file descriptors
 used for NIO in the HDFS core. I do not recall the exact details.

 On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote:

  Hi.
 
  After tracing some more with the lsof utility, and I managed to stop the
  growth on the DataNode process, but still have issues with my DFS client.
 
  It seems that my DFS client opens hundreds of pipes and eventpolls. Here
 is
  a small part of the lsof output:
 
  java10508 root  387w  FIFO0,6   6142565 pipe
  java10508 root  388r  FIFO0,6   6142565 pipe
  java10508 root  389u     0,100  6142566
  eventpoll
  java10508 root  390u  FIFO0,6   6135311 pipe
  java10508 root  391r  FIFO0,6   6135311 pipe
  java10508 root  392u     0,100  6135312
  eventpoll
  java10508 root  393r  FIFO0,6   6148234 pipe
  java10508 root  394w  FIFO0,6   6142570 pipe
  java10508 root  395r  FIFO0,6   6135857 pipe
  java10508 root  396r  FIFO0,6   6142570 pipe
  java10508 root  397r     0,100  6142571
  eventpoll
  java10508 root  398u  FIFO0,6   6135319 pipe
  java10508 root  399w  FIFO0,6   6135319 pipe
 
  I'm using FSDataInputStream and FSDataOutputStream, so this might be
  related
  to pipes?
 
  So, my questions are:
 
  1) What happens these pipes/epolls to appear?
 
  2) More important, how I can prevent their accumation and growth?
 
  Thanks in advance!
 
  2009/6/21 Stas Oskin stas.os...@gmail.com
 
   Hi.
  
   I have HDFS client and HDFS datanode running on same machine.
  
   When I'm trying to access a dozen of files at once from the client,
  several
   times in a row, I'm starting to receive the following errors on client,
  and
   HDFS browse function.
  
   HDFS Client: Could not get block locations. Aborting...
   HDFS browse: Too many open files
  
   I can increase the maximum number of files that can opened, as I have
 it
   set to the default 1024, but would like to first solve the problem, as
   larger value just means it would run out of files again later on.
  
   So my questions are:
  
   1) Does the HDFS datanode keeps any files opened, even after the HDFS
   client have already closed them?
  
   2) Is it possible to find out, who keeps the opened files - datanode or
   client (so I could pin-point the source of the problem).
  
   Thanks in advance!
  
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Too many open files error, which gets resolved after some time

2009-06-21 Thread jason hadoop
Yes.
Otherwise the file descriptors will flow away like water.
I also strongly suggest having at least 64k file descriptors as the open
file limit.

On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 Thanks for the advice. So you advice explicitly closing each and every file
 handle that I receive from HDFS?

 Regards.

 2009/6/21 jason hadoop jason.had...@gmail.com

  Just to be clear, I second Brian's opinion. Relying on finalizes is a
 very
  good way to run out of file descriptors.
 
  On Sun, Jun 21, 2009 at 9:32 AM, brian.lev...@nokia.com wrote:
 
   IMHO, you should never rely on finalizers to release scarce resources
  since
   you don't know when the finalizer will get called, if ever.
  
   -brian
  
  
  
   -Original Message-
   From: ext jason hadoop [mailto:jason.had...@gmail.com]
   Sent: Sunday, June 21, 2009 11:19 AM
   To: core-user@hadoop.apache.org
   Subject: Re: Too many open files error, which gets resolved after
 some
   time
  
   HDFS/DFS client uses quite a few file descriptors for each open file.
  
   Many application developers (but not the hadoop core) rely on the JVM
   finalizer methods to close open files.
  
   This combination, expecially when many HDFS files are open can result
 in
   very large demands for file descriptors for Hadoop clients.
   We as a general rule never run a cluster with nofile less that 64k, and
  for
   larger clusters with demanding applications have had it set 10x higher.
 I
   also believe there was a set of JVM versions that leaked file
 descriptors
   used for NIO in the HDFS core. I do not recall the exact details.
  
   On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com
  wrote:
  
Hi.
   
After tracing some more with the lsof utility, and I managed to stop
  the
growth on the DataNode process, but still have issues with my DFS
  client.
   
It seems that my DFS client opens hundreds of pipes and eventpolls.
  Here
   is
a small part of the lsof output:
   
java10508 root  387w  FIFO0,6   6142565
  pipe
java10508 root  388r  FIFO0,6   6142565
  pipe
java10508 root  389u     0,100  6142566
eventpoll
java10508 root  390u  FIFO0,6   6135311
  pipe
java10508 root  391r  FIFO0,6   6135311
  pipe
java10508 root  392u     0,100  6135312
eventpoll
java10508 root  393r  FIFO0,6   6148234
  pipe
java10508 root  394w  FIFO0,6   6142570
  pipe
java10508 root  395r  FIFO0,6   6135857
  pipe
java10508 root  396r  FIFO0,6   6142570
  pipe
java10508 root  397r     0,100  6142571
eventpoll
java10508 root  398u  FIFO0,6   6135319
  pipe
java10508 root  399w  FIFO0,6   6135319
  pipe
   
I'm using FSDataInputStream and FSDataOutputStream, so this might be
related
to pipes?
   
So, my questions are:
   
1) What happens these pipes/epolls to appear?
   
2) More important, how I can prevent their accumation and growth?
   
Thanks in advance!
   
2009/6/21 Stas Oskin stas.os...@gmail.com
   
 Hi.

 I have HDFS client and HDFS datanode running on same machine.

 When I'm trying to access a dozen of files at once from the client,
several
 times in a row, I'm starting to receive the following errors on
  client,
and
 HDFS browse function.

 HDFS Client: Could not get block locations. Aborting...
 HDFS browse: Too many open files

 I can increase the maximum number of files that can opened, as I
 have
   it
 set to the default 1024, but would like to first solve the problem,
  as
 larger value just means it would run out of files again later on.

 So my questions are:

 1) Does the HDFS datanode keeps any files opened, even after the
 HDFS
 client have already closed them?

 2) Is it possible to find out, who keeps the opened files -
 datanode
  or
 client (so I could pin-point the source of the problem).

 Thanks in advance!

   
  
  
  
   --
   Pro Hadoop, a book to guide you from beginner to hadoop mastery,
   http://www.amazon.com/dp/1430219424?tag=jewlerymall
   www.prohadoopbook.com a community for Hadoop Professionals
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Restrict output of mappers to reducers running on same node?

2009-06-19 Thread jason hadoop
Yes, you are correct. I had not thought about sharing a file handle through
multiple tasks via jvm reuse.


On Thu, Jun 18, 2009 at 9:43 AM, Tarandeep Singh tarand...@gmail.comwrote:

 Jason, correct me if I am wrong-

 opening Sequence file in the configure (or setup method in 0.20) and
 writing
 to it is same as doing output.collect( ), unless you mean I should make the
 sequence file writer static variable and set reuse jvm flag to -1. In that
 case the subsequent mappers might be run in the same JVM and they can use
 the same writer and hence produce one file. But in that case I need to add
 a
 hook to close the writer - may be use the shutdown hook.

 Jothi, the idea of combine input format is good. But I guess I have to
 write
 somethign of my own to make it work in my case.

 Thanks guys for the suggestions... but I feel we should have some support
 from the framework to merge the output of mapper only job so that we don't
 get a lot number of smaller files. Sometimes you just don't want to run
 reducers and unnecessarily transfer a whole lot of data across the network.

 Thanks,
 Tarandeep

 On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop jason.had...@gmail.com
 wrote:

  You can open your sequence file in the mapper configure method, write to
 it
  in your map, and close it in the mapper close method.
  Then you end up with 1 sequence file per map. I am making an assumption
  that
  each key,value to your map some how represents a single xml file/item.
 
  On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan 
 joth...@yahoo-inc.com
  wrote:
 
   You could look at CombineFileInputFormat to generate a single split out
  of
   several files.
  
   Your partitioner would be able to assign keys to specific reducers, but
  you
   would not have control on which node a given reduce task will run.
  
   Jothi
  
  
   On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote:
  
Hi,
   
Can I restrict the output of mappers running on a node to go to
   reducer(s)
running on the same node?
   
Let me explain why I want to do this-
   
I am converting huge number of XML files into SequenceFiles. So
theoretically I don't even need reducers, mappers would read xml
 files
   and
output Sequencefiles. But the problem with this approach is I will
 end
  up
getting huge number of small output files.
   
To avoid generating large number of smaller files, I can Identity
   reducers.
But by running reducers, I am unnecessarily transfering data over
   network. I
ran some test case using a small subset of my data (~90GB). With map
  only
jobs, my cluster finished conversion in only 6 minutes. But with map
  and
Identity reducers job, it takes around 38 minutes.
   
I have to process close to a terabyte of data. So I was thinking of a
   faster
alternatives-
   
* Writing a custom OutputFormat
* Somehow restrict output of mappers running on a node to go to
  reducers
running on the same node. May be I can write my own partitioner
  (simple)
   but
not sure how Hadoop's framework assigns partitions to reduce tasks.
   
Any pointers ?
   
Or this is not possible at all ?
   
Thanks,
Tarandeep
  
  
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-19 Thread jason hadoop
You can pass the -D mapred.child.java.opts=-Xmx[some value] as part of your
job, or set it in your job conf before your task is submitted. THen the per
task jvm's will use that string as part of the jvm initialization paramter
set

The distributed cache is used for making files and archives that are stored
in hdfs, available in the local file system working area of your tasks.

The GenericOptionsParser class that most Hadoop user interfaces use,
provides a couple of command line arguments that allow you to specify local
file system files which are copied into hdfs and then made avilable as
stated above
-files and libjars are the to arguments.

My book has a solid discussion and example set for the distributed cache in
chapter 5.


On Thu, Jun 18, 2009 at 1:45 PM, akhil1988 akhilan...@gmail.com wrote:


 Hi Jason!

 I finally found out that there was some problem in reserving the HEAPSIZE
 which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE
 using export from our user account, after we have started the Hadoop. It
 has
 to changed by the root.

 I have a user account on the cluster and I was trying to change the
 Hadoop_heapsize from my user account using 'export' which had no effect.
 So I had to request my cluster administrator to increase the
 HADOOP_HEAPSIZE
 in hadoop-env.sh and then restart hadoop. Now the program is running
 absolutely fine. Thanks for your help.

 One thing that I would like to ask you is that can we use DistributerCache
 for transferring directories to the local cache of the tasks?

 Thanks,
 Akhil



 akhil1988 wrote:
 
  Hi Jason!
 
  Thanks for going with me to solve my problem.
 
  To restate things and make it more easier to understand: I am working in
  local mode in the directory which contains the job jar and also the
 Config
  and Data directories.
 
  I just removed the following three statements from my code:
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
  DistributedCache.createSymlink(conf);
 
  The program executes till the same point as before now also and
  terminates. That means the above three statements are of no use while
  working in local mode. In local mode, the working directory for the
  mapreduce tasks becomes the current woking direcotry in which you
 started
  the hadoop command to execute the job.
 
  Since I have removed the DistributedCache.add. statements there
 should
  be no issue whether I am giving a file name or a directory name as
  argument to it. Now it seems to me that there is some problem in reading
  the binary file using binaryRead.
 
  Please let me know if I am going wrong anywhere.
 
  Thanks,
  Akhil
 
 
 
 
 
  jason hadoop wrote:
 
  I have only ever used the distributed cache to add files, including
  binary
  files such as shared libraries.
  It looks like you are adding a directory.
 
  The DistributedCache is not generally used for passing data, but for
  passing
  file names.
  The files must be stored in a shared file system (hdfs for simplicity)
  already.
 
  The distributed cache makes the names available to the tasks, and the
 the
  files are extracted from hdfs and stored in the task local work area on
  each
  task tracker node.
  It looks like you may be storing the contents of your files in the
  distributed cache.
 
  On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  Thanks Jason.
 
  I went inside the code of the statement and found out that it
 eventually
  makes some binaryRead function call to read a binary file and there it
  strucks.
 
  Do you know whether there is any problem in giving a binary file for
  addition to the distributed cache.
  In the statement DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a
 directory
  which contains some text as well as some binary files. In the statement
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I
  can
  see(in the output messages) that it is able to read the text files but
  it
  gets struck at the binary files.
 
  So, I think here the problem is: it is not able to read the binary
 files
  which either have not been transferred to the cache or a binary file
  cannot
  be read.
 
  Do you know the solution to this?
 
  Thanks,
  Akhil
 
 
  jason hadoop wrote:
  
   Something is happening inside of your (Parameters.
   readConfigAndLoadExternalData(Config/allLayer1.config);)
   code, and the framework is killing the job for not heartbeating for
  600
   seconds
  
   On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com
  wrote:
  
  
   One more thing, finally it terminates there (after some time) by
  giving
   the
   final Exception:
  
   java.io.IOException: Job failed!
  at
  org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
   at LbjTagger.NerTagger.main(NerTagger.java

Re: JobControl for Pipes?

2009-06-18 Thread jason hadoop
http://www.cascading.org/
https://issues.apache.org/jira/browse/HADOOP-5303 (oozie)

On Thu, Jun 18, 2009 at 6:19 AM, Roshan James 
roshan.james.subscript...@gmail.com wrote:

 Can you give me a url or so to both? I cant seem to find either one after a
 couple of basic web searches.

 Also, when you say JobControl is coming to Hadoop - I can already see the
 Java JobControl classes that lets one express dependancies between jobs. So
 I assume this already works in Java - does it not? I was asking if this
 functionality is exposed via Pipes in some way.

 Roshan


 On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Job control is coming with the Hadoop WorkFlow manager, in the mean time
  there is cascade by chris wensel. I do not have any personal experience
  with
  either. I do not know how pipes interacts with either.
 
  On Wed, Jun 17, 2009 at 12:43 PM, Roshan James 
  roshan.james.subscript...@gmail.com wrote:
 
   Hello, Is there any way to express dependencies between map-reduce jobs
   (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs?  The
   provided header Pipes.hh does not seem to reflect any such
 capabilities.
  
   best,
   Roshan
  
 
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.amazon.com/dp/1430219424?tag=jewlerymall
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Getting Task ID inside a Mapper

2009-06-18 Thread jason hadoop
The task id is readily available, if you override the configure method.
The MapReduceBase class in the Pro Hadoop Book examples does this and makes
the taskId available as a class field.

On Thu, Jun 18, 2009 at 7:33 AM, Mark Desnoyer mdesno...@gmail.com wrote:

 Thanks! I'll try that.

 -Mark

 On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com
 wrote:

  I think you can use job.getInt(mapred.task.partition,-1) to get the
  mapper ID, which should be the same for the mapper across task reruns.
 
  -Original Message-
  From: Piotr Praczyk [mailto:piotr.prac...@gmail.com]
  Sent: 18 June 2009 15:19
  To: core-user@hadoop.apache.org
  Subject: Re: Getting Task ID inside a Mapper
 
  Hi
  Why don't you provide a seed of random generator generated outside the
  task
  ? Then when the task fails, you can provide the same value stored
  somewhere
  outside.
  You could use the task configuration to do so.
 
  I don't know anything about obtaining the task ID from within.
 
 
  regards
  Piotr
 
  2009/6/18 Mark Desnoyer mdesno...@gmail.com
 
   Hi,
  
   I was wondering if it's possible to get a hold of the task id inside a
   mapper? I cant' seem to find a way by trolling through the API
  reference.
   I'm trying to implement a Map Reduce version of Latent Dirichlet
  Allocation
   and I need to be able to initialize a random number generator in a
  task
   specific way so that if the task fails and is rerun elsewhere, the
  results
   are the same. Thanks in advance.
  
   Cheers,
   Mark Desnoyer
  
 
 
 
  This message should be regarded as confidential. If you have received
 this
  email in error please notify the sender and destroy it immediately.
  Statements of intent shall only become binding when confirmed in hard
 copy
  by an authorised signatory.  The contents of this email may relate to
  dealings with other companies within the Detica Group plc group of
  companies.
 
  Detica Limited is registered in England under No: 1337451.
 
  Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP,
  England.
 
 
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Practical limit on emitted map/reduce values

2009-06-18 Thread jason hadoop
In general if the values become very large, it becomes simpler to store them
outline in hdfs, and just pass the hdfs path for the item as the value in
the map reduce task.
This greatly reduces the amount of IO done, and doesn't blow up the sort
space on the reducer.
You loose the magic of data locality, but given the item size, and you gain
the IO back by not having to pass the full values to the reducer, or handle
them when sorting the map outputs.

On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen l.p.mer...@solatis.com wrote:

 Hello Owen,

  Keys and values can be large. They are certainly capped above by
  Java's 2GB limit on byte arrays. More practically, you will have
  problems running out of memory with keys or values of 100 MB. There is
  no restriction that a key/value pair fits in a single hdfs block, but
  performance would suffer. (In particular, the FileInputFormats split
  at block sized chunks, which means you will have maps that scan an
  entire block without processing anything.)

 Thanks for the quick reply.

 Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit
 that is caused by the Java VM heap size ? If so, could that, for example, be
 increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ?

 Regards,

 Leon Mergen




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-17 Thread jason hadoop
I have only ever used the distributed cache to add files, including binary
files such as shared libraries.
It looks like you are adding a directory.

The DistributedCache is not generally used for passing data, but for passing
file names.
The files must be stored in a shared file system (hdfs for simplicity)
already.

The distributed cache makes the names available to the tasks, and the the
files are extracted from hdfs and stored in the task local work area on each
task tracker node.
It looks like you may be storing the contents of your files in the
distributed cache.

On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote:


 Thanks Jason.

 I went inside the code of the statement and found out that it eventually
 makes some binaryRead function call to read a binary file and there it
 strucks.

 Do you know whether there is any problem in giving a binary file for
 addition to the distributed cache.
 In the statement DistributedCache.addCacheFile(new
 URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory
 which contains some text as well as some binary files. In the statement
 Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can
 see(in the output messages) that it is able to read the text files but it
 gets struck at the binary files.

 So, I think here the problem is: it is not able to read the binary files
 which either have not been transferred to the cache or a binary file cannot
 be read.

 Do you know the solution to this?

 Thanks,
 Akhil


 jason hadoop wrote:
 
  Something is happening inside of your (Parameters.
  readConfigAndLoadExternalData(Config/allLayer1.config);)
  code, and the framework is killing the job for not heartbeating for 600
  seconds
 
  On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote:
 
 
  One more thing, finally it terminates there (after some time) by giving
  the
  final Exception:
 
  java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
  at LbjTagger.NerTagger.main(NerTagger.java:109)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 
 
  akhil1988 wrote:
  
   Thank you Jason for your reply.
  
   My Map class is an inner class and it is a static class. Here is the
   structure of my code.
  
   public class NerTagger {
  
   public static class Map extends MapReduceBase implements
   MapperLongWritable, Text, Text, Text{
   private Text word = new Text();
   private static NETaggerLevel1 tagger1 = new
   NETaggerLevel1();
   private static NETaggerLevel2 tagger2 = new
   NETaggerLevel2();
  
   Map(){
   System.out.println(HI2\n);
  
   Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
   System.out.println(HI3\n);
  
   Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
  
   System.out.println(loading the tagger);
  
  
 
 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
   System.out.println(HI5\n);
  
  
 
 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
   System.out.println(Done- loading the
 tagger);
   }
  
   public void map(LongWritable key, Text value,
   OutputCollectorText, Text output, Reporter reporter ) throws
  IOException
   {
   String inputline = value.toString();
  
   /* Processing of the input pair is done here
 */
   }
  
  
   public static void main(String [] args) throws Exception {
   JobConf conf = new JobConf(NerTagger.class);
   conf.setJobName(NerTagger);
  
   conf.setOutputKeyClass(Text.class);
   conf.setOutputValueClass(IntWritable.class);
  
   conf.setMapperClass(Map.class);
   conf.setNumReduceTasks(0);
  
   conf.setInputFormat(TextInputFormat.class);
   conf.setOutputFormat(TextOutputFormat.class);
  
   conf.set(mapred.job.tracker, local);
   conf.set(fs.default.name, file:///);
  
   DistributedCache.addCacheFile(new

Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon corrected code is LUCKYOU

2009-06-17 Thread jason hadoop
You can purchase the ebook from www.apress.com. The final copy is now
available.
There is a 50% off coupon good for a few more days, LUCKYOU.

you can try prohadoop.ning.com as an alternative for www.prohadoopbook.com,
or www.prohadoop.com.

What error do you receive when you try to visit www.prohadoopbook.com ?

2009/6/17 zjffdu zjf...@gmail.com

 HI Jason,

 Where can I download your books' Alpha Chapters, I am very interested in
 your book about hadoop.

 And I cannot visit the link www.prohadoopbook.com



 -Original Message-
 From: jason hadoop [mailto:jason.had...@gmail.com]
 Sent: 2009年6月9日 20:47
 To: core-user@hadoop.apache.org
 Subject: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09
 summit here is a 50% off coupon corrected code is LUCKYOU

 http://eBookshop.apress.com CODE LUCKYOU

 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Restrict output of mappers to reducers running on same node?

2009-06-17 Thread jason hadoop
You can open your sequence file in the mapper configure method, write to it
in your map, and close it in the mapper close method.
Then you end up with 1 sequence file per map. I am making an assumption that
each key,value to your map some how represents a single xml file/item.

On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan joth...@yahoo-inc.comwrote:

 You could look at CombineFileInputFormat to generate a single split out of
 several files.

 Your partitioner would be able to assign keys to specific reducers, but you
 would not have control on which node a given reduce task will run.

 Jothi


 On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote:

  Hi,
 
  Can I restrict the output of mappers running on a node to go to
 reducer(s)
  running on the same node?
 
  Let me explain why I want to do this-
 
  I am converting huge number of XML files into SequenceFiles. So
  theoretically I don't even need reducers, mappers would read xml files
 and
  output Sequencefiles. But the problem with this approach is I will end up
  getting huge number of small output files.
 
  To avoid generating large number of smaller files, I can Identity
 reducers.
  But by running reducers, I am unnecessarily transfering data over
 network. I
  ran some test case using a small subset of my data (~90GB). With map only
  jobs, my cluster finished conversion in only 6 minutes. But with map and
  Identity reducers job, it takes around 38 minutes.
 
  I have to process close to a terabyte of data. So I was thinking of a
 faster
  alternatives-
 
  * Writing a custom OutputFormat
  * Somehow restrict output of mappers running on a node to go to reducers
  running on the same node. May be I can write my own partitioner (simple)
 but
  not sure how Hadoop's framework assigns partitions to reduce tasks.
 
  Any pointers ?
 
  Or this is not possible at all ?
 
  Thanks,
  Tarandeep




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: JobControl for Pipes?

2009-06-17 Thread jason hadoop
Job control is coming with the Hadoop WorkFlow manager, in the mean time
there is cascade by chris wensel. I do not have any personal experience with
either. I do not know how pipes interacts with either.

On Wed, Jun 17, 2009 at 12:43 PM, Roshan James 
roshan.james.subscript...@gmail.com wrote:

 Hello, Is there any way to express dependencies between map-reduce jobs
 (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs?  The
 provided header Pipes.hh does not seem to reflect any such capabilities.

 best,
 Roshan




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Can I share datas for several map tasks?

2009-06-16 Thread jason hadoop
In the examples for my book is a jvm reuse with static data shared between
jvm's example

On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote:

 Thanks for your reply. Can you do me a favor to make a check?
 I modified mapred-default.xml as follows:
540 property
541   namemapred.job.reuse.jvm.num.tasks/name
542   value-1/value
543   descriptionHow many tasks to run per jvm. If set to -1, there is
544   no limit.
545   /description
546 /property
 And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop;

 This is my program:

 17 public class WordCount {
 18
 19   public static class TokenizerMapper
 20extends MapperObject, Text, Text, IntWritable{
 21
 22 private final static IntWritable one = new IntWritable(1);
 23 private Text word = new Text();
 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16];
 25
 26 protected void setup(Context context
 27 ) throws IOException, InterruptedException {
 28 //Init shared data
 29 ToBeSharedData[0] = 12345;
 30 System.out.println(setup shared data[0] =  +
 ToBeSharedData[0]);
 31 }
 32
 33 public void map(Object key, Text value, Context context
 34 ) throws IOException, InterruptedException {
 35   StringTokenizer itr = new StringTokenizer(value.toString());
 36   while (itr.hasMoreTokens()) {
 37 word.set(itr.nextToken());
 38 context.write(word, one);
 39   }
 40   System.out.println(read shared data[0] =  +
 ToBeSharedData[0]);
 41 }
 42   }

 First, can you tell me how to make sure jvm reuse is taking effect, for I
 didn't see anything different from before. I use top command under linux
 and see the same number of java processes and same memory usage.

 Second, can you tell me how to make the ToBeSharedData be inited only
 once
 and can be read from other MapTasks on the same node? Or this is not a
 suitable programming style for map-reduce?

 By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a
 single-node.
 thanks in advance

 On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com
 wrote:

 
  snowloong wrote:
   Hi,
   I want to share some data structures for the map tasks on a same
 node(not
  through files), I mean, if one map task has already initialized some data
  structures (e.g. an array or a list), can other map tasks share these
  memorys and directly access them, for I don't want to reinitialize these
  datas and I want to save some memory. Can hadoop help me do this?
 
  You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks
  in mapred-default.xml for usage. Then you can cache the data in a static
  variable in your mapper.
 
  - Sharad
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Datanodes fail to start

2009-06-16 Thread jason hadoop
I often find myself editing the src/saveVersion.sh to fake out the version
numbers, when I build a hadoop jar for the first time, and have to deploy it
on an an already running cluster.


On Mon, Jun 15, 2009 at 11:57 PM, Ian jonhson jonhson@gmail.com wrote:

 If you rebuilt the hadoop, following the wikipage of HowToRelease may
 reduce the trouble occurred.


 On Sat, May 16, 2009 at 7:20 AM, Pankil Doshiforpan...@gmail.com wrote:
  I got the solution..
 
  Namespace IDs where some how incompatible.So I had to clean data dir and
  temp dir ,format the cluster and make a fresh start
 
  Pankil
 
  On Fri, May 15, 2009 at 2:25 AM, jason hadoop jason.had...@gmail.com
 wrote:
 
  There should be a few more lines at the end.
  We only want the part from last the STARTUP_MSG to the end
 
  On one of mine a successfull start looks like this:
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host = at/192.168.1.119
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.19.1-dev
  STARTUP_MSG:   build =  -r ; compiled by 'jason' on Tue Mar 17 04:03:57
 PDT
  2009
  /
  2009-03-17 03:08:11,884 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Registered
  FSDatasetStatusMBean
  2009-03-17 03:08:11,886 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at
  50010
  2009-03-17 03:08:11,889 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is
  1048576 bytes/s
  2009-03-17 03:08:12,142 INFO org.mortbay.http.HttpServer: Version
  Jetty/5.1.4
  2009-03-17 03:08:12,155 INFO org.mortbay.util.Credential: Checking
 Resource
  aliases
  2009-03-17 03:08:12,518 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@1e184cb
  2009-03-17 03:08:12,578 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/static,/static]
  2009-03-17 03:08:12,721 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@1d9e282
  2009-03-17 03:08:12,722 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/logs,/logs]
  2009-03-17 03:08:12,878 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.servlet.webapplicationhand...@14a75bb
  2009-03-17 03:08:12,884 INFO org.mortbay.util.Container: Started
  WebApplicationContext[/,/]
  2009-03-17 03:08:12,951 INFO org.mortbay.http.SocketListener: Started
  SocketListener on 0.0.0.0:50075
  2009-03-17 03:08:12,951 INFO org.mortbay.util.Container: Started
  org.mortbay.jetty.ser...@1358f03
  2009-03-17 03:08:12,957 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
  Initializing JVM Metrics with processName=DataNode, sessionId=null
  2009-03-17 03:08:13,242 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
  Initializing RPC Metrics with hostName=DataNode, port=50020
  2009-03-17 03:08:13,264 INFO org.apache.hadoop.ipc.Server: IPC Server
  Responder: starting
  2009-03-17 03:08:13,304 INFO org.apache.hadoop.ipc.Server: IPC Server
  listener on 50020: starting
  2009-03-17 03:08:13,343 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 0 on 50020: starting
  2009-03-17 03:08:13,343 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration =
  DatanodeRegistration(192.168.1.119:50010,
  storageID=DS-540597485-192.168.1.119-50010-1237022386925,
 infoPort=50075,
  ipcPort=50020)
  2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 1 on 50020: starting
  2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server
  handler 2 on 50020: starting
  2009-03-17 03:08:13,351 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(
  192.168.1.119:50010,
  storageID=DS-540597485-192.168.1.119-50010-1237022386925,
 infoPort=50075,
  ipcPort=50020)In DataNode.run, data =
  FSDataset{dirpath='/tmp/hadoop-0.19.0-jason/dfs/data/current'}
  2009-03-17 03:08:13,352 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: using
 BLOCKREPORT_INTERVAL
  of 360msec Initial delay: 0msec
  2009-03-17 03:08:13,391 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 14
 blocks
  got processed in 27 msecs
  2009-03-17 03:08:13,392 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block
  scanner.
 
 
 
  On Thu, May 14, 2009 at 9:51 PM, Pankil Doshi forpan...@gmail.com
 wrote:
 
   This is log from datanode.
  
  
   2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 12 msecs
   2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 8 msecs
   2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 9 msecs
   2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 12 msecs
   2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode:
 BlockReport
  of
   82 blocks got processed in 9

Re: Debugging Map-Reduce programs

2009-06-16 Thread jason hadoop
When you are running in local mode you have 2 basic choices if you want to
interact with a debugger.
You can launch from within eclipse or other IDE, or you can setup a java
debugger transport as part of the mapred.child.java.opts variable, and
attach to the running jvm.
By far the simplest is loading via eclipse.

Your other alternative is to inform the framework to retain the job files
via keep.failed.task.files (be careful here you will fill your disk with old
dead data) and use the debug the IsolationRunner

Examples in my book :)


On Mon, Jun 15, 2009 at 6:49 PM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 I am running in a local mode . Can you tell me how to set those breakpoints
 or how to access those files so that i can debug the program.

 The program is generating  = java.lang.NumberFormatException: For input
 string: 

 But that particular string is the one which is the input to the mapclass .
 So I think that it is not reading my input correctly .. But when i try to
 print the same .. it isn't printing to the STDOUT ..
 Iam using the FileInputFormat class

  FileInputFormat.addInputPath(conf, new
 Path(/home/rip/Desktop/hadoop-0.18.3/input));
 FileOutputFormat.setOutputPath(conf, new
 Path(/home/rip/Desktop/hadoop-0.18.3/output));

 input and output are folders for inp and outpt.

 It is generating these warnings also

 09/06/16 12:38:32 WARN fs.FileSystem: local is a deprecated filesystem
 name. Use file:/// instead.

 Thanks in advance


 On Tue, Jun 16, 2009 at 3:50 AM, Aaron Kimball aa...@cloudera.com wrote:

  On Mon, Jun 15, 2009 at 10:01 AM, bharath vissapragada 
  bhara...@students.iiit.ac.in wrote:
 
   Hi all ,
  
   When running hadoop in local mode .. can we use print statements to
  print
   something to the terminal ...
 
 
  Yes. In distributed mode, each task will write its stdout/stderr to files
  which you can access through the web-based interface.
 
 
  
   Also iam not sure whether the program is reading my input files ... If
 i
   keep print statements it isn't displaying any .. can anyone tell me how
  to
   solve this problem.
 
 
  Is it generating exceptions? Are the files present? If you're running in
  local mode, you can use a debugger; set a breakpoint in your map() method
  and see if it gets there. How are you configuring the input files for
 your
  job?
 
 
  
  
   Thanks in adance,
  
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals


Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
Is it possible that your map class is an inner class and not static?

On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote:


 Hi All,

 I am running my mapred program in local mode by setting
 mapred.jobtracker.local to local mode so that I can debug my code.
 The mapred program is a direct porting of my original sequential code.
 There
 is no reduce phase.
 Basically, I have just put my program in the map class.

 My program takes around 1-2 min. in instantiating the data objects which
 are
 present in the constructor of Map class(it loads some data model files,
 therefore it takes some time). After the instantiation part in the
 constrcutor of Map class the map function is supposed to process the input
 split.

 The problem is that the data objects do not get instantiated completely and
 in between(whlie it is still in constructor) the program stops giving the
 exceptions pasted at bottom.
 The program runs fine without mapreduce and does not require more than 2GB
 memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am
 working on machines with 16GB RAM), the program fails. I have also set
 HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC
 Overhead Limit Exceeded exceptions also.

 Somebody, please help me with this problem: I have trying to debug it for
 the last 3 days, but unsuccessful. Thanks!

 java.lang.OutOfMemoryError: Java heap space
at
 sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889)
at java.lang.Double.toString(Double.java:179)
at java.text.DigitList.set(DigitList.java:272)
at java.text.DecimalFormat.format(DecimalFormat.java:584)
at java.text.DecimalFormat.format(DecimalFormat.java:507)
at java.text.NumberFormat.format(NumberFormat.java:269)
at
 org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001
 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81)
at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
at
 org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
at
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
 Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
 Method)
at

 sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at

 sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79)
... 5 more
 Caused by: java.lang.ThreadDeath
at java.lang.Thread.stop(Thread.java:715)
at
 org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310)
at
 org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224)
at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

 --
 View this message in context:
 

Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded

2009-06-16 Thread jason hadoop
Something is happening inside of your (Parameters.
readConfigAndLoadExternalData(Config/allLayer1.config);)
code, and the framework is killing the job for not heartbeating for 600
seconds

On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote:


 One more thing, finally it terminates there (after some time) by giving the
 final Exception:

 java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
 at LbjTagger.NerTagger.main(NerTagger.java:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)


 akhil1988 wrote:
 
  Thank you Jason for your reply.
 
  My Map class is an inner class and it is a static class. Here is the
  structure of my code.
 
  public class NerTagger {
 
  public static class Map extends MapReduceBase implements
  MapperLongWritable, Text, Text, Text{
  private Text word = new Text();
  private static NETaggerLevel1 tagger1 = new
  NETaggerLevel1();
  private static NETaggerLevel2 tagger2 = new
  NETaggerLevel2();
 
  Map(){
  System.out.println(HI2\n);
 
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config);
  System.out.println(HI3\n);
 
  Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true);
 
  System.out.println(loading the tagger);
 
 
 tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1);
  System.out.println(HI5\n);
 
 
 tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2);
  System.out.println(Done- loading the tagger);
  }
 
  public void map(LongWritable key, Text value,
  OutputCollectorText, Text output, Reporter reporter ) throws
 IOException
  {
  String inputline = value.toString();
 
  /* Processing of the input pair is done here */
  }
 
 
  public static void main(String [] args) throws Exception {
  JobConf conf = new JobConf(NerTagger.class);
  conf.setJobName(NerTagger);
 
  conf.setOutputKeyClass(Text.class);
  conf.setOutputValueClass(IntWritable.class);
 
  conf.setMapperClass(Map.class);
  conf.setNumReduceTasks(0);
 
  conf.setInputFormat(TextInputFormat.class);
  conf.setOutputFormat(TextOutputFormat.class);
 
  conf.set(mapred.job.tracker, local);
  conf.set(fs.default.name, file:///);
 
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Data/), conf);
  DistributedCache.addCacheFile(new
  URI(/home/akhil1988/Ner/OriginalNer/Config/), conf);
  DistributedCache.createSymlink(conf);
 
 
  conf.set(mapred.child.java.opts,-Xmx4096m);
 
  FileInputFormat.setInputPaths(conf, new Path(args[0]));
  FileOutputFormat.setOutputPath(conf, new Path(args[1]));
 
  System.out.println(HI1\n);
 
  JobClient.runJob(conf);
  }
 
  Jason, when the program executes HI1 and HI2 are printed but it does not
  reaches HI3. In the statement
  Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it
 is
  able to access Config/allLayer1.config file (as while executing this
  statement, it prints some messages like which data it is loading, etc.)
  but it gets stuck there(while loading some classifier) and never reaches
  HI3.
 
  This program runs fine when executed normally(without mapreduce).
 
  Thanks, Akhil
 
 
 
 
  jason hadoop wrote:
 
  Is it possible that your map class is an inner class and not static?
 
  On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com
 wrote:
 
 
  Hi All,
 
  I am running my mapred program in local mode by setting
  mapred.jobtracker.local to local mode so that I can debug my code.
  The mapred program is a direct porting of my original sequential code.
  There
  is no reduce phase.
  Basically, I have just put my program in the map class.
 
  My program takes around 1-2 min. in instantiating the data objects
 which
  are
  present in the constructor of Map class(it loads some

Re: java.lang.ClassNotFoundException

2009-06-15 Thread jason hadoop
Your class is not in your jar, or your jar is not avialable in the hadoop
class path.

On Mon, Jun 15, 2009 at 2:39 AM, bharath vissapragada 
bhara...@students.iiit.ac.in wrote:

 Hi all ,

 When i try to run my own progam (jar file)  i get the following error.

 java.lang.ClassNotFoundException : file-Name.class
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:336)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68


 Can anyone tell me the reason(s) for this ..

 Thanks in advance




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: get number of values for a key

2009-06-14 Thread jason hadoop
It would be nice if there was an interface compliant way. Perhaps it becomes
available in the 0.20 and beyond api's.

On Sat, Jun 13, 2009 at 3:40 PM, Rares Vernica rvern...@gmail.com wrote:

 Hello,

 In Reduce, can I get the number of values for the current key without
 iterating over them? Does Hadoop has this number?

 Or, at least the total number of pairs that will be processed by the
 current Reduce instance. I am pretty sure that Hadoop already knows
 this number because it sorted them.

 BTW, the iterators given to Reduce are one-time use iterators, right?

 Thanks!
 Rares




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: 2009 Hadoop Summit West - was wonderful

2009-06-14 Thread jason hadoop
This is the best I have at present: http://www.cs.cmu.edu/~priya/

On Sat, Jun 13, 2009 at 11:05 AM, zsongbo zson...@gmail.com wrote:

 Hi Jason,
 Could you please post more information about  Pria Narasimhan's toolset
 for automated fault detection in hadoop clusters?
 Such as url or others.

 Thanks.
 Schubert

 On Thu, Jun 11, 2009 at 11:26 AM, jason hadoop jason.had...@gmail.com
 wrote:

  I had a great time, smoozing with people, and enjoyed a couple of the
 talks
 
  I would love to see more from Pria Narasimhan, hope their toolset for
  automated fault detection in hadoop clusters becomes generally available.
  Zookeeper rocks on!
 
  Hbase is starting to look really good, in 0.20 the master node and the
  single point of failure and configuration headache goes away and
 Zookeeper
  takes over.
 
  Owen O'Mally ave a solid presentation on the new Hadoop API's and the
  reasons for the changes.
 
  It was good to hang with everyone, see you all next year!
 
  I even got to spend a little time chatting with Tom White, and a signed
  copy
  of his book, thanks Tom!
 
 
  --
  Pro Hadoop, a book to guide you from beginner to hadoop mastery,
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: The behavior of HashPartitioner

2009-06-12 Thread jason hadoop
You can always write something simple to hand call the HashPartitioner.
Jython works for quick tests.
But the code in  hash partitioner is essentially ((int) key.hashcode()) %
num reduces.
Since nothing else is in play, I suspect there is an incorrect assumption
somewhere.


On Fri, Jun 12, 2009 at 11:25 AM, Zhengguo 'Mike' SUN zhengguo...@yahoo.com
 wrote:

 Hi,

 The intermediate key generated by my Mappers is IntWritable. I tested with
 different number of Reducers. When the number of Reducers is the same as the
 number of different keys of intermediate output. It partitions perfectly.
 Each Reducer receives one input group. When these two numbers are different,
 the partitioning function becomes difficult to understand. For example, when
 the number of keys is less than the number of Reducers, I am expecting that
 each Reducer at most receive one input group. But it turns out that many
 Reducers receive more than one input group. On the other hand, when the
 number of keys is larger than the number of Reducers, I am expecting that
 each Reducer at least receive one input group. But it turns out that some
 Reducers receive nothing to process. The expectation I had is from the
 implementation of HashPartitioner class, which just uses modulo operator
 with the number of Reducers to generate partitions.

 Anyone has any insights into this?








-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: HDFS data transfer!

2009-06-12 Thread jason hadoop
Also check the IO wait time on your datanodes, if the io wait time is high,
you can't win.

On Fri, Jun 12, 2009 at 11:24 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 What's your replication factor?  What aggregate I/O rates do you see in
 Ganglia?  Is the I/O spikey, or has it plateaued?

 We can hit close to network rate (1Gbps) per node locally, and have pretty
 similar hardware.

 Brian


 On Jun 12, 2009, at 9:03 AM, Scott wrote:

  I ran the put command on 3 of the nodes simultaneously to copy files that
 were local on those machines into the hdfs.

 Brian Bockelman wrote:

 What'd you do for the tests?  Was it a single stream or a multiple stream
 test?

 Brian

 On Jun 12, 2009, at 6:48 AM, Scott wrote:

  So is ~ 1GB/minute transfer rate a reasonable performance benchmark?
  Our test cluster consists of 4 quad core xeon machines with 2 non-raided
 drives each.  My initial tests show a transfer rate of around 1GB/minute,
 and that was slower that I expected it to be.

 Thanks,
 Scott


 Brian Bockelman wrote:

 Hey Sugandha,

 Transfer rates depend on the quality/quantity of your hardware and the
 quality of your client disk that is generating the data.  I usually say 
 that
 you should expect near-hardware-bottleneck speeds for an otherwise idle
 cluster.

 There should be no make it fast required (though you should reviewi
 the logs for errors if it's going slow).  I would expect a 5GB file to 
 take
 around 3-5 minutes to write on our cluster, but it's a well-tuned and
 operational cluster.

 As Todd (I think) mentioned before, we can't help any when you say I
 want to make it faster.  You need to provide diagnostic information - 
 logs,
 Ganglia plots, stack traces, something - that folks can look at.

 Brian

 On Jun 10, 2009, at 2:25 AM, Sugandha Naolekar wrote:

  But if I want to make it fast, then??? I want to place the data in
 HDFS and
 reoplicate it in fraction of seconds. Can that be possible. and How?

 On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena kartik@gmail.com
 wrote:

  I would suppose about 2-3 hours. It took me some 2 days to load a 160
 Gb
 file.
 Secura

 On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar
 sugandha@gmail.comwrote:It

  Hello!

 If I try to transfer a 5GB VDI file from a remote host(not a part of

 hadoop

 cluster) into HDFS, and get it back, how much time is it supposed to

 take?


 No map-reduce involved. Simply Writing files in and out from HDFS
 through

 a

 simple code of java (usage of API's).

 --
 Regards!
 Sugandha





 --
 Regards!
 Sugandha







-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Hadoop streaming - No room for reduce task error

2009-06-11 Thread jason hadoop
The reduce output may spill to disk during the sort, and if it expected to
be larger than the partition free space, unless the machine/jvm has a hugh
allowed memory space, the data will spill to disk during the sort.
If I did my math correctly, you are trying to push ~2TB through the single
reduce.

as for the part- files, if you have the number of reduces set to zero,
you will get N part files, where N is the number of map tasks.

If you absolutely must have it all go to one reduce, you will need to
increase the free disk space. I think 19.1 preserves compression for the map
output, so you could try enabling compression for map output.

If you have many nodes, you can set the number of reduces to some number and
then use sort -M on the part files, to merge sort them, assuming your reduce
preserves ordering.

Try adding these parameters to your job line:
-D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK

BTW, /bin/cat works fine as an identity mapper or an identity reducer


On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon t...@cloudera.com wrote:

 Hey Scott,
 It turns out that Alex's answer was mistaken - your error is actually
 coming
 from lack of disk space on the TT that has been assigned the reduce task.
 Specifically, there is not enough space in mapred.local.dir. You'll need to
 change your mapred.local.dir to point to a partition that has enough space
 to contain your reduce output.

 As for why this is the case, I hope someone will pipe up. It seems to me
 that reduce output can go directly to the target filesystem without using
 space on mapred.local.dir.

 Thanks
 -Todd

 On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard a...@cloudera.com
 wrote:

  What is mapred.child.ulimit set to?  This configuration options specifics
  how much memory child processes are allowed to have.  You may want to up
  this limit and see what happens.
 
  Let me know if that doesn't get you anywhere.
 
  Alex
 
  On Wed, Jun 10, 2009 at 9:40 AM, Scott skes...@weather.com wrote:
 
   Complete newby map/reduce question here.  I am using hadoop streaming
 as
  I
   come from a Perl background, and am trying to prototype/test a process
 to
   load/clean-up ad server log lines from multiple input files into one
  large
   file on the hdfs that can then be used as the source of a hive db
 table.
   I have a perl map script that reads an input line from stdin, does the
   needed cleanup/manipulation, and writes back to stdout.I don't
 really
   need a reduce step, as I don't care what order the lines are written
 in,
  and
   there is no summary data to produce.  When I run the job with -reducer
  NONE
   I get valid output, however I get multiple part-x files rather than
  one
   big file.
   So I wrote a trivial 'reduce' script that reads from stdin and simply
   splits the key/value, and writes the value back to stdout.
  
   I am executing the code as follows:
  
   ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper
   /usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl -reducer
   /usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl -input
  /logs/*.log
   -output test9
  
   The code I have works when given a small set of input files.  However,
 I
   get the following error when attempting to run the code on a large set
 of
   input files:
  
   hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09
  15:43:00,905
   WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task.
  Node
   tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has
 2004049920
   bytes free; but we expect reduce input to take 22138478392
  
   I assume this is because the all the map output is being buffered in
  memory
   prior to running the reduce step?  If so, what can I change to stop the
   buffering?  I just need the map output to go directly to one large
 file.
  
   Thanks,
   Scott
  
  
 




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Large size Text file split

2009-06-10 Thread jason hadoop
There is always NLineInputFormat. You specify the number of lines per split.
The key is the position of the line start in the file, value is the line
itself.
The parameter mapred.line.input.format.linespermap controls the number of
lines per split

On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi 
harish.mallipe...@gmail.com wrote:

 On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com
 wrote:

  Hi, all
 
  I have a large csv file ( larger than 10 GB ), I'd like to use a certain
  InputFormat to split it into smaller part thus each Mapper can deal with
  piece of the csv file. However, as far as I know, FileInputFormat only
  cares about byte size of file, that is, the class can divide the csv
  file as many part, and maybe some part is not a well-format CVS file.
  For example, one line of the CSV file is not terminated with CRLF, or
  maybe some text is trimed.
 
  How to ensure each FileSplit is a smaller valid CSV file using a proper
  InputFormat?
 
  BR/anderson
 

 If all you care about is the splits occurring at line boundaries, then
 TextInputFormat will work.

 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html

 If not I guess you can write your own InputFormat class.

 --
 Harish Mallipeddi
 http://blog.poundbang.in




-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


2009 Hadoop Summit West - was wonderful

2009-06-10 Thread jason hadoop
I had a great time, smoozing with people, and enjoyed a couple of the talks

I would love to see more from Pria Narasimhan, hope their toolset for
automated fault detection in hadoop clusters becomes generally available.
Zookeeper rocks on!

Hbase is starting to look really good, in 0.20 the master node and the
single point of failure and configuration headache goes away and Zookeeper
takes over.

Owen O'Mally ave a solid presentation on the new Hadoop API's and the
reasons for the changes.

It was good to hang with everyone, see you all next year!

I even got to spend a little time chatting with Tom White, and a signed copy
of his book, thanks Tom!


-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,

2009-06-09 Thread jason hadoop
I just sent a note to the publisher, hopefully they will fix it, especially
since I just printed up 100 flyers to give out at the hadoop summit!

On Tue, Jun 9, 2009 at 7:37 PM, Burt Beckwith b...@burtbeckwith.com wrote:

 Does this apply to the printed book or eBook? I tried it with the eBook but
 it didn't work: ERROR: The promotional code 'LUCKYYOU' does not exist.

 Burt

 On Tuesday 09 June 2009 10:15:24 pm jason hadoop wrote:
  In honor of the Hadoop Summit on June 10th(tomorrow), Apress has agreed
 to
  provide some conference swag, in the form of a 50% off coupon
 
  Purchase the book at http://eBookshop.apress.com and use code LUCKYYOU,
  for
  50% off the list price.
  The coupon has a short valid time so don't delay your purchase :)
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,

2009-06-09 Thread jason hadoop
CORRECTED CODE, LUCKYOU

I miss read the flyer.

On Tue, Jun 9, 2009 at 8:45 PM, jason hadoop jason.had...@gmail.com wrote:

 I just sent a note to the publisher, hopefully they will fix it, especially
 since I just printed up 100 flyers to give out at the hadoop summit!


 On Tue, Jun 9, 2009 at 7:37 PM, Burt Beckwith b...@burtbeckwith.comwrote:

 Does this apply to the printed book or eBook? I tried it with the eBook
 but it didn't work: ERROR: The promotional code 'LUCKYYOU' does not exist.

 Burt

 On Tuesday 09 June 2009 10:15:24 pm jason hadoop wrote:
  In honor of the Hadoop Summit on June 10th(tomorrow), Apress has agreed
 to
  provide some conference swag, in the form of a 50% off coupon
 
  Purchase the book at http://eBookshop.apress.com and use code LUCKYYOU,
  for
  50% off the list price.
  The coupon has a short valid time so don't delay your purchase :)
 
 




 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: What should I do to implements writable?

2009-06-09 Thread jason hadoop
A writeable basically needs to implement two methods:
  /**
   * Serialize the fields of this object to codeout/code.
   *
   * @param out codeDataOuput/code to serialize this object into.
   * @throws IOException
   */
  void write(DataOutput out) throws IOException;

  /**
   * Deserialize the fields of this object from codein/code.
   *
   * pFor efficiency, implementations should attempt to re-use storage in
the
   * existing object where possible./p
   *
   * @param in codeDataInput/code to deseriablize this object from.
   * @throws IOException
   */
  void readFields(DataInput in) throws IOException;

These use the serialization primitives to pack and unpack the object.

These are from the hadoop Text class in 0.19.1
  /** deserialize
   */
  public void readFields(DataInput in) throws IOException {
int newLength = WritableUtils.readVInt(in);
setCapacity(newLength, false);
in.readFully(bytes, 0, newLength);
length = newLength;
  }

  /** serialize
   * write this object to out
   * length uses zero-compressed encoding
   * @see Writable#write(DataOutput)
   */
  public void write(DataOutput out) throws IOException {
WritableUtils.writeVInt(out, length);
out.write(bytes, 0, length);
  }

If you look at the various implementors of Writable, you will find plenty of
examples in the source tree.
For very complex objects, the simplest thing to do is to serialize them to
an ObjectOutputStream that is backed by a ByteArrayOutputStream then use the
write/read byte array method on the DataOutput/DataInput

In my code I check to see if in/out implement the the ObjectXStream and
use it directly, or use an intermediate byte array.





On Tue, Jun 9, 2009 at 9:23 AM, HRoger hanxianyongro...@163.com wrote:


 hello:
 I write a class containing a treeset to implements writable , how should I
 to implements the two methods?
 Any help would be appreciated!
 --
 View this message in context:
 http://www.nabble.com/What-should-I-do-to-implements-writable--tp23946397p23946397.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Map-Reduce!

2009-06-08 Thread jason hadoop
A very common one is processing large quantities of log files and producing
summary date.
Another use is simply as a way of distributing large jobs across multiple
computers.
In a previous job, we used Map/Reduce for distributed bulk web crawling, and
for distributed media file processing.

On Mon, Jun 8, 2009 at 3:54 AM, Sugandha Naolekar sugandha@gmail.comwrote:

 Hello!

 As far as I have read the forums of Map-reduce, it is basically used to
 process large amount of data speedily. right??

 But, can you please give me some instances or examples wherein, I can use
 map-reduce..???



 --
 Regards!
 Sugandha




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Is there any way to debug the hadoop job in eclipse

2009-06-06 Thread jason hadoop
The chapters are available for download now.

On Sat, Jun 6, 2009 at 3:33 AM, zhang jianfeng zjf...@gmail.com wrote:

 Is there any resource on internet that I can get as soon as possible ?



 On Fri, Jun 5, 2009 at 6:43 PM, jason hadoop jason.had...@gmail.com
 wrote:

  chapter 7 of my book goes into details of hour to debug with eclipse
 
  On Fri, Jun 5, 2009 at 3:40 AM, zhang jianfeng zjf...@gmail.com wrote:
 
   Hi all,
  
   Some jobs I submit to hadoop failed, but I can not see what's the
  problem.
   So is there any way to debug the hadoop job in eclipse, such as the
  remote
   debug.
  
   or others ways to find the job failed reason. I didnot find enough
   information in the job tracker.
  
   Thank you.
  
   Jeff Zhang
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Is there any way to debug the hadoop job in eclipse

2009-06-05 Thread jason hadoop
chapter 7 of my book goes into details of hour to debug with eclipse

On Fri, Jun 5, 2009 at 3:40 AM, zhang jianfeng zjf...@gmail.com wrote:

 Hi all,

 Some jobs I submit to hadoop failed, but I can not see what's the problem.
 So is there any way to debug the hadoop job in eclipse, such as the remote
 debug.

 or others ways to find the job failed reason. I didnot find enough
 information in the job tracker.

 Thank you.

 Jeff Zhang




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Task files in _temporary not getting promoted out

2009-06-04 Thread jason hadoop
Are your tasks failing or completing successfully. Failed tasks have the
output directory wiped, only successfully completed tasks have the files
moved up.

I don't recall if the FileOutputCommitter class appeared in 0.18


On Wed, Jun 3, 2009 at 6:43 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 Ok, help.  I am trying to create local task outputs in my reduce job, and
 they get created, then go poof when the job's done.

 My first take was to use FileOutputFormat.getWorkOutputPath, and create
 directories in there for my outputs (which are Lucene indexes).
  Exasperated, I then wrote a small OutputFormat/RecordWriter pair to write
 the indexes.  In each case, I can see directories being created in
 attempt_foo/_temporary, but when the task is over they're gone.

 I've stared at TextOutputFormat and I can't figure out why it's files
 survive and mine don't.  Help!  Again, this is 0.18.3.

 Thanks,
 Ian




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: problem getting map input filename

2009-06-02 Thread jason hadoop
you can always dump the entire property space and work it out that way.
I haven't used the 0.20 api's yet so I can't speak to them

On Tue, Jun 2, 2009 at 10:52 AM, Rares Vernica rvern...@gmail.com wrote:

 On 6/2/09, randy...@comcast.net randy...@comcast.net wrote:
 
  Your Map class needs to have a configure method that can access the
 JobConf.
  Like this:

 In Hadoop 0.20 the new Mapper class does no longer have configure
 and JobConf has been replaced with Job. In the Mapper methods, you now
 get a Context object.

  public void configure(JobConf conf)
  {
 System.out.println(conf.get(map.input.file);
  }




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Reduce() time takes ~4x Map()

2009-05-28 Thread jason hadoop
At the minimal level, enable map output compression, it may make some
difference, mapred.compress.map.output.
Sorting is very expensive when there are many keys and the values are large.
Are you quite certain your keys are unique.
Also, do you need them sorted by document id?


On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan joth...@yahoo-inc.comwrote:

 Hi David,

 If you go to JobTrackerHistory and then click on this job and then do
 Analyse This Job, you should be able to get the split up timings for the
 individual phases of the map and reduce tasks, including the average, best
 and worst times. Could you provide those numbers so that we can get a
 better
 idea of how the job progressed.

 Jothi


 On 5/28/09 10:11 PM, David Batista dsbati...@xldb.di.fc.ul.pt wrote:

  Hi everyone,
 
  I'm processing XML files, around 500MB each with several documents,
  for the map() function I pass a document from the XML file, which
  takes some time to process depending on the size - I'm applying NER to
  texts.
 
  Each document has a unique identifier, so I'm using that identifier as
  a key and the results of parsing the document in one string as the
  output:
 
  so at the end of  the map function():
  output.collect( new Text(identifier), new Text(outputString));
 
  usually the outputString is around 1k-5k size
 
  reduce():
  public void reduce(Text key, IteratorText values,
  OutputCollectorText, Text output, Reporter reporter) {
  while (values.hasNext()) {
  Text text = values.next();
  try {
  output.collect(key, text);
  } catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
  }
  }
  }
 
  I did a test using only 1 machine with 8 cores, and only 1 XML file,
  it took around 3 hours to process all maps and  ~12hours for the
  reduces!
 
  the XML file has 139 945 documents
 
  I set the jobconf for 1000 maps() and 200 reduces()
 
  I did took a look at graphs on the web interface during the reduce
  phase, and indeed its the copy phase that's taking much of the time,
  the sort and reduce phase are done almost instantly.
 
  Why does the copy phase takes so long? I understand that the copies
  are made using HTTP, and the data was in really small chunks 1k-5k
  size, but even so, being everything in the same physical machine
  should have been faster, no?
 
  Any suggestions on what might be causing the copies in reduce to take so
 long?
  --
  ./david




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Efficient algorithm for many-to-many reduce-side join?

2009-05-28 Thread jason hadoop
Use the mapside join stuff, if I understand your problem it provides a good
solution but requires getting over the learning hurdle.
Well described in chapter 8 of my book :)


On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel ch...@wensel.net wrote:

 I believe PIG, and I know Cascading use a kind of 'spillable' list that can
 be re-iterated across. PIG's version is a bit more sophisticated last I
 looked.

 that said, if you were using either one of them, you wouldn't need to write
 your own many-to-many join.

 cheers,
 ckw


 On May 28, 2009, at 8:14 AM, Todd Lipcon wrote:

  One last possible trick to consider:

 If you were to subclass SequenceFileRecordReader, you'd have access to its
 seek method, allowing you to rewind the reducer input. You could then
 implement a block hash join with something like the following pseudocode:

 ahash = new HashMapKey, Val();
 while (i have ram available) {
  read a record
  if the record is from table B, break
  put the record into ahash
 }
 nextAPos = reader.getPos()

 while (current record is an A record) {
  skip to next record
 }
 firstBPos = reader.getPos()

 while (current record has current key) {
  read and join against ahash
  process joined result
 }

 if firstBPos  nextAPos {
  seek(nextAPos)
  go back to top
 }


 On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon t...@cloudera.com wrote:

  Hi Stuart,

 It seems to me like you have a few options.

 Option 1: Just use a lot of RAM. Unless you really expect many millions
 of
 entries on both sides of the join, you might be able to get away with
 buffering despite its inefficiency.

 Option 2: Use LocalDirAllocator to find some local storage to spill all
 of
 the left table's records to disk in a MapFile format. Then as you iterate
 over the right table, do lookups in the MapFile. This is really the same
 as
 option 1, except that you're using disk as an extension of RAM.

 Option 3: Convert this to a map-side merge join. Basically what you need
 to
 do is sort both tables by the join key, and partition them with the same
 partitioner into the same number of columns. This way you have an equal
 number of part-N files for both tables, and within each part-N
 file
 they're ordered by join key. In each map task, you open both
 tableA/part-N
 and tableB/part-N and do a sequential merge to perform the join. I
 believe
 the CompositeInputFormat class helps with this, though I've never used
 it.

 Option 4: Perform the join in several passes. Whichever table is smaller,
 break into pieces that fit in RAM. Unless my relational algebra is off, A
 JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1
 UNION
 B2.

 Hope that helps
 -Todd


 On Thu, May 28, 2009 at 5:02 AM, Stuart White stuart.whi...@gmail.com
 wrote:

  I need to do a reduce-side join of two datasets.  It's a many-to-many
 join; that is, each dataset can can multiple records with any given
 key.

 Every description of a reduce-side join I've seen involves
 constructing your keys out of your mapper such that records from one
 dataset will be presented to the reducers before records from the
 second dataset.  I should hold on to the value from the one dataset
 and remember it as I iterate across the values from the second
 dataset.

 This seems like it only works well for one-to-many joins (when one of
 your datasets will only have a single record with any given key).
 This scales well because you're only remembering one value.

 In a many-to-many join, if you apply this same algorithm, you'll need
 to remember all values from one dataset, which of course will be
 problematic (and won't scale) when dealing with large datasets with
 large numbers of records with the same keys.

 Does an efficient algorithm exist for a many-to-many reduce-side join?




 --
 Chris K Wensel
 ch...@concurrentinc.com
 http://www.concurrentinc.com




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: avoid custom crawler getting blocked

2009-05-27 Thread jason hadoop
Random ordering helps with per thread delays based on domain recency also
helps.

On Wed, May 27, 2009 at 6:47 AM, Ken Krugler kkrugler_li...@transpac.comwrote:

 My current project is to gather stats from a lot of different documents.
 We're are not indexing just getting quite specific stats for each
 document.
 We gather 12 different stats from each document.

 Our requirements have changed somewhat now, originally it was working on
 documents from our own servers but now it needs to fetch other ones from
 quite a large variety of sources.

 My approach up to now was to have the map function simply take each
 filepath
 (or now URL) in turn, fetch the document, calculate the stats and output
 those stats.

 My new problem is some of the locations we are now visiting don't like
 their
 IP being hit multiple times in a row.

 Is it possible to check a URL against a visited list of IPs and if
 recently
 visited either wait for a certain amount of time or push it back onto the
 input stack so it will be processed later in the queue?

 Or is there a better way?


 Your use case is very similar to what we've been doing with Bixo. See
 http://bixo.101tec.com, and also
 http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf

 Short answer is that we group URLs by paid-level domain in a map (actually
 using a Cascading GroupBy operation), and use per-domain queues with
 multi-threaded fetchers to efficiently load pages in a reduce (a Cascading
 Buffer operation).

 -- Ken
 --
 Ken Krugler
 +1 530-210-6378




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Specifying NameNode externally to hadoop-site.xml

2009-05-25 Thread jason hadoop
if you launch your jobs via bin/hadoop jar jar_file [main class]  [options]

you can simply specify -fs hdfs://host:port before the jar_file

On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 I'm looking to move the Hadoop NameNode URL outside the hadoop-site.xml
 file, so I could set it at the run-time.

 Any idea how to do it?

 Or perhaps there is another configuration that can be applied to the
 FileSystem object?

 Regards.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Specifying NameNode externally to hadoop-site.xml

2009-05-25 Thread jason hadoop
conf.set(fs.default.name, hdfs://host:port);
where conf is the JobConf object of your job, before you submit it.


On Mon, May 25, 2009 at 10:16 AM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 Thanks for the tip, but is it possible to set this in dynamic way via code?

 Thanks.

 2009/5/25 jason hadoop jason.had...@gmail.com

  if you launch your jobs via bin/hadoop jar jar_file [main class]
  [options]
 
  you can simply specify -fs hdfs://host:port before the jar_file
 
  On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com
 wrote:
 
   Hi.
  
   I'm looking to move the Hadoop NameNode URL outside the hadoop-site.xml
   file, so I could set it at the run-time.
  
   Any idea how to do it?
  
   Or perhaps there is another configuration that can be applied to the
   FileSystem object?
  
   Regards.
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread jason hadoop
Can you give your machines multiple IP addresses, and bind the grid server
to a different IP than the datanode
With solaris you could put it in a different zone,

On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman bbock...@math.unl.eduwrote:

 Hey all,

 Had a problem I wanted to ask advice on.  The Caltech site I work with
 currently have a few GridFTP servers which are on the same physical machines
 as the Hadoop datanodes, and a few that aren't.  The GridFTP server has a
 libhdfs backend which writes incoming network data into HDFS.

 They've found that the GridFTP servers which are co-located with HDFS
 datanode have poor performance because data is incoming at a much faster
 rate than the HDD can handle.  The standalone GridFTP servers, however, push
 data out to multiple nodes at one, and can handle the incoming data just
 fine (200MB/s).

 Is there any way to turn off the preference for the local node?  Can anyone
 think of a good workaround to trick HDFS into thinking the client isn't on
 the same node?

 Brian




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Could only be replicated to 0 nodes, instead of 1

2009-05-21 Thread jason hadoop
It does not appear that any datanodes have connected to your namenode.
on the datanode machines look in the hadoop logs directory at the datanode
log files.
There should be some information there that helps you diagnose the problem.

chapter 4 of my book provides some detail on work with this problem

On Thu, May 21, 2009 at 4:29 AM, ashish pareek pareek...@gmail.com wrote:

 Hi ,

I have two suggestion

 i)Choose a right version ( Hadoop- 0.18 is good)
 ii)replication should be 3 as ur having 3 modes.( Indirectly see to it that
 ur configuration is correct !!)

 Hey even i am just suggesting this as i am also a new to hadoop

 Ashish Pareek


 On Thu, May 21, 2009 at 2:41 PM, Stas Oskin stas.os...@gmail.com wrote:

  Hi.
 
  I'm testing Hadoop in our lab, and started getting the following message
  when trying to copy a file:
  Could only be replicated to 0 nodes, instead of 1
 
  I have the following setup:
 
  * 3 machines, 2 of them with only 80GB of space, and 1 with 1.5GB
  * Two clients are copying files all the time (one of them is the 1.5GB
  machine)
  * The replication is set on 2
  * I let the space on 2 smaller machines to end, to test the behavior
 
  Now, one of the clients (the one located on 1.5GB) works fine, and the
  other
  one - the external, unable to copy and displays the error + the exception
  below
 
  Any idea if this expected on my scenario? Or how it can be solved?
 
  Thanks in advance.
 
 
 
  09/05/21 10:51:03 WARN dfs.DFSClient: NotReplicatedYetException sleeping
  /test/test.bin retries left 1
 
  09/05/21 10:51:06 WARN dfs.DFSClient: DataStreamer Exception:
  org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
  /test/test.bin could only be replicated to 0 nodes, instead of 1
 
 at
 
 
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123
  )
 
 at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
 
 at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
 
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25
  )
 
 at java.lang.reflect.Method.invoke(Method.java:597)
 
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
 
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
 
 
 
 at org.apache.hadoop.ipc.Client.call(Client.java:716)
 
 at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
 
 at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
 
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 
 at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39
  )
 
 at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25
  )
 
 at java.lang.reflect.Method.invoke(Method.java:597)
 
 at
 
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82
  )
 
 at
 
 
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59
  )
 
 at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450
  )
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333
  )
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745
  )
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922
  )
 
 
 
  09/05/21 10:51:06 WARN dfs.DFSClient: Error Recovery for block null bad
  datanode[0]
 
  java.io.IOException: Could not get block locations. Aborting...
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2153
  )
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1745
  )
 
 at
 
 
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1899
  )
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Multipleoutput file

2009-05-21 Thread jason hadoop
setInputPaths will take an array, or variable arguments.
or you can simply provide the directory that the individual files reside in,
and the individual files will be added.

If there are other files in the directory, you may need to specify a custom
input path filter via FileInputFormat.setInputPathFilter.


2009/5/21 皮皮 pi.bingf...@gmail.com

 yes , but how can i get the commaSeperatedPaths? As i can't specify it
 handy.

 it's not practicable to do that:

 commaSeperatedPaths_1 = MAPPINGOUTPUT-r-1;
 commaSeperatedPaths_2 = MAPPINGOUTPUT-r-2;

 FileInputFormat.setInputPaths(job, commaSeperatedPaths_1);
 FileInputFormat.setInputPaths(job, commaSeperatedPaths_2);



 2009/4/7 Brian MacKay brian.mac...@medecision.com

 
  Not sure about your question:  seems like you'd like to do this...?
 
  After you run job, your output may be like MAPPINGOUTPUT-r-1,
  MAPPINGOUTPUT-r-2, etc.
 
  You'd need to set them as multiple inputs.
 
  FileInputFormat.setInputPaths(job, commaSeperatedPaths);
 
 
  Brian
 
  -Original Message-
  From: 皮皮 [mailto:pi.bingf...@gmail.com]
  Sent: Tuesday, April 07, 2009 3:30 AM
  To: core-user@hadoop.apache.org
  Subject: Re: Multiple k,v pairs from a single map - possible?
 
  could any body tell me how to get one of the multipleoutput file in
 another
  jobconfig?
 
  2009/4/3 皮皮 pi.bingf...@gmail.com
 
   thank you very much . this is what i am looking for.
  
   2009/3/27 Brian MacKay brian.mac...@medecision.com
  
  
   Amandeep,
  
   Add this to your driver.
  
   MultipleOutputs.addNamedOutput(conf, PHONE,TextOutputFormat.class,
   Text.class, Text.class);
  
   MultipleOutputs.addNamedOutput(conf, NAME,
  TextOutputFormat.class, Text.class, Text.class);
  
  
  
   And in your reducer
  
private MultipleOutputs mos;
  
   public void reduce(Text key, IteratorText values,
  OutputCollectorText, Text output, Reporter reporter) {
  
  
// namedOutPut = either PHONE or NAME
  
  while (values.hasNext()) {
  String value = values.next().toString();
  mos.getCollector(namedOutPut, reporter).collect(
  new Text(value), new Text(othervals));
  }
  }
  
  @Override
  public void configure(JobConf conf) {
  super.configure(conf);
  mos = new MultipleOutputs(conf);
  }
  
  public void close() throws IOException {
  mos.close();
  }
  
  
  
   By the way, have you had a change to post your Oracle fix to
   DBInputFormat ?
   If so, what is the Jira tag #?
  
   Brian
  
   -Original Message-
   From: Amandeep Khurana [mailto:ama...@gmail.com]
   Sent: Friday, March 27, 2009 5:46 AM
   To: core-user@hadoop.apache.org
   Subject: Multiple k,v pairs from a single map - possible?
  
   Is it possible to output multiple key value pairs from a single map
   function
   run?
  
   For example, the mapper outputing name,phone and name, address
   simultaneously...
  
   Can I write multiple output.collect(...) commands?
  
   Amandeep
  
   Amandeep Khurana
   Computer Science Graduate Student
   University of California, Santa Cruz
  
  
  
  
  
   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _
  _
   _
  
   The information transmitted is intended only for the person or entity
 to
   which it is addressed and may contain confidential and/or privileged
   material. Any review, retransmission, dissemination or other use of,
 or
   taking of any action in reliance upon, this information by persons or
   entities other than the intended recipient is prohibited. If you
  received
   this message in error, please contact the sender and delete the
 material
   from any computer.
  
  
  
  
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _
 
  The information transmitted is intended only for the person or entity to
  which it is addressed and may contain confidential and/or privileged
  material. Any review, retransmission, dissemination or other use of, or
  taking of any action in reliance upon, this information by persons or
  entities other than the intended recipient is prohibited. If you received
  this message in error, please contact the sender and delete the material
  from any computer.
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Randomize input file?

2009-05-21 Thread jason hadoop
The last time I had to do something like this, in the map phase, I made the
key a random value, md5 of the key, and
built a new value that had the real key embedded.

Then in the reduce phase I received the records in random order and could do
what I wanted.
By using a stable but differently sorting value for the key, my reduce still
grouped correctly but I received the calls to reduce in a random order
compared to the normal sort order of the data.



On Thu, May 21, 2009 at 12:25 PM, Alex Loddengaard a...@cloudera.comwrote:

 Bhupesh,

 I forgot to say that the concatenation phase of my plan would concatenate
 randomly.  As I mentioned, this wouldn't be a good way to randomize, but
 it'd be pretty easy.

 Anyway, your solution is much more clever and does a better job
 randomizing.  Good thinking!

 Thanks,

 Alex

 On Thu, May 21, 2009 at 11:36 AM, Bhupesh Bansal bban...@linkedin.com
 wrote:

  Hmm ,
 
  IMHO running a mapper only job will give you an output file
  With same order. You should write a custom map-reduce job
  Where map emits (Key:Integer.random() , value:line)
  And reducer Output (key:NOTHING , value:line)
 
  Reducer will sort on Integer.random() giving you a random ordering for
 your
  input file.
 
  Best
  Bhupesh
 
 
  On 5/21/09 11:15 AM, Alex Loddengaard a...@cloudera.com wrote:
 
   Hi John,
  
   I don't know of a built-in way to do this.  Depending on how well you
  want
   to randomize, you could just run a MapReduce job with at least one map
  (the
   more maps, the more random) and no reduces.  When you run a job with no
   reduces, the shuffle phase is skipped entirely, and the intermediate
  outputs
   from the mappers are stored directly to HDFS.  Though I think each
 mapper
   will create one HDFS file, so you'll have to concatenate all files into
 a
   single file.
  
   The above isn't a very good way to randomize, but it's fairly easy to
   implement and should run pretty quickly.
  
   Hope this helps.
  
   Alex
  
   On Thu, May 21, 2009 at 7:18 AM, John Clarke clarke...@gmail.com
  wrote:
  
   Hi,
  
   I have a need to randomize my input file before processing. I
 understand
  I
   can chain Hadoop jobs together so the first could take the input file
   randomize it and then the second could take the randomized file and do
  the
   processing.
  
   The input file has one entry per line and I want to mix up the lines
  before
   the main processing.
  
   Is there an inbuilt ability I have missed or will I have to try and
  write a
   Hadoop program to shuffle my input file?
  
   Cheers,
   John
  
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Optimal Filesystem (and Settings) for HDFS

2009-05-19 Thread jason hadoop
I always disable atime and it's ilk
The deadline scheduler helps with the (non xfs hanging) du datanode timeout
issues, but not much.

Ultimately that is a caching failure in the kernel, due to the hadoop io
patterns.

Anshu, any luck getting off the PAE kernels? Is this the xfs lockup, or just
the du taking to long?

At one point, sagar and I talked about replacing the du call with a script
that used the df as a rapid and close proxy, to get rid of the du calls, the
block report was another problem

On Tue, May 19, 2009 at 3:59 PM, Anshuman Sachdeva asachd...@attributor.com
 wrote:

 Hi Brian,
 thanks for the mail. I have an issue when we use xfs. hadoop runs
 du -sk after every 10 min on my cluster and some times it goes in the loop
 and machine hangs. Have you seen this issue or its only me?

 I'll really appreciate if some one can put some light on this


 Anshuman
 - Original Message -
 From: Bryan Duxbury br...@rapleaf.com
 To: core-user@hadoop.apache.org
 Sent: Tuesday, May 19, 2009 2:50:57 PM GMT -08:00 US/Canada Pacific
 Subject: Re: Optimal Filesystem (and Settings) for HDFS

 We use XFS for our data drives, and we've had somewhat mixed results.
 One of the biggest pros is that XFS has more free space than ext3,
 even with the reserved space settings turned all the way to 0.
 Another is that you can format a 1TB drive as XFS in about 0 seconds,
 versus minutes for ext3. This makes it really fast to kickstart our
 worker nodes.

 We have seen some weird stuff happen though when machines run out of
 memory, apparently because the XFS driver does something odd with
 kernel memory. When this happens, we end up having to do some fscking
 before we can get that node back online.

 As far as outright performance, I actually *did* do some tests of xfs
 vs ext3 performance on our cluster. If you just look at a single
 machine's local disk speed, you can write and read noticeably faster
 when using XFS instead of ext3. However, the reality is that this
 extra disk performance won't have much of an effect on your overall
 job completion performance, since you will find yourself network
 bottlenecked well in advance of even ext3's performance.

 The long and short of it is that we use XFS to speed up our new
 machine deployment, and that's it.

 -Bryan

 On May 18, 2009, at 10:31 AM, Alex Loddengaard wrote:

  I believe Yahoo! uses ext3, though I know other people have said
  that XFS
  has performed better in various benchmarks.  We use ext3, though we
  haven't
  done any benchmarks to prove its worth.
 
  This question has come up a lot, so I think it'd be worth doing a
  benchmark
  and writing up the results.  I haven't been able to find a detailed
  analysis
  / benchmark writeup comparing various filesystems, unfortunately.
 
  Hope this helps,
 
  Alex
 
  On Mon, May 18, 2009 at 8:54 AM, Bob Schulze
  b.schu...@ecircle.com wrote:
 
  We are currently rebuilding our cluster - has anybody
  recommendations on
  the underlaying file system? Just standard Ext3?
 
  I could imagine that the block size could be larger than its
  default...
 
  Thx for any tips,
 
 Bob
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: FSDataOutputStream flush() not working?

2009-05-17 Thread jason hadoop
When you open a file you have the option, blockSize
  /**
   * Opens an FSDataOutputStream at the indicated Path with write-progress
   * reporting.
   * @param f the file name to open
   * @param permission
   * @param overwrite if a file with this name already exists, then if true,
   *   the file will be overwritten, and if false an error will be thrown.
   * @param bufferSize the size of the buffer to be used.
   * @param replication required block replication for the file.
   * @param blockSize
   * @param progress
   * @throws IOException
   * @see #setPermission(Path, FsPermission)
   */
  public abstract FSDataOutputStream create(Path f,
  FsPermission permission,
  boolean overwrite,
  int bufferSize,
  short replication,
  long blockSize,
  Progressable progress) throws IOException;

On Fri, May 15, 2009 at 12:44 PM, Sasha Dolgy sdo...@gmail.com wrote:

 Hi Todd,
 Reading through the JIRA, my impression is that data will be written out to
 hdfs only once it has reached a certain size in the buffer.  Is it possible
 to define the size of that buffer?  Or is this a future enhancement?

 -sasha

 On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon t...@cloudera.com wrote:

  Hi Sasha,
 
  What version are you running? Up until very recent versions, sync() was
 not
  implemented. Even in the newest releases, sync isn't completely finished,
  and you may find unreliable behavior.
 
  For now, if you need this kind of behavior, your best bet is to close
 each
  file and then open the next every N minutes. For example, if you're
  processing logs every 5 minutes, simply close log file log.00223 and
 round
  robin to log.00224 right before you need the data to be available to
  readers. If you're collecting data at a low rate, these files may end up
  being rather small, and you should probably look into doing merges on the
  hour/day/etc to avoid small-file proliferation.
 
  If you want to track the work being done around append and sync, check
 out
  HADOOP-5744 and the issues referenced therein:
 
  http://issues.apache.org/jira/browse/HADOOP-5744
 
  Hope that helps,
  -Todd
 
  On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy sdo...@gmail.com wrote:
 
   Hi there, forgive the repost:
  
   Right now data is received in parallel and is written to a queue, then
 a
   single thread reads the queue and writes those messages to a
   FSDataOutputStream which is kept open, but the messages never get
  flushed.
Tried flush() and sync() with no joy.
  
   1.
   outputStream.writeBytes(rawMessage.toString());
  
   2.
  
   log.debug(Flushing stream, size =  + s.getOutputStream().size());
   s.getOutputStream().sync();
   log.debug(Flushed stream, size =  + s.getOutputStream().size());
  
   or
  
   log.debug(Flushing stream, size =  + s.getOutputStream().size());
   s.getOutputStream().flush();
   log.debug(Flushed stream, size =  + s.getOutputStream().size());
  
   The size() remains the same after performing this action.
  
   2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28)
   hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
   2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49)
   hdfs.HdfsQueueConsumer: Re-using existing stream
   2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63)
   hdfs.HdfsQueueConsumer: Flushing stream, size = 1986
   2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013)
   hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986
   lastFlushOffset 1731
   2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66)
   hdfs.HdfsQueueConsumer: Flushed stream, size = 1986
   2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39)
   hdfs.HdfsQueueConsumer: Consumer writing event
   2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28)
   hdfs.HdfsQueueConsumer: Thread 19 getting an output stream
   2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49)
   hdfs.HdfsQueueConsumer: Re-using existing stream
   2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63)
   hdfs.HdfsQueueConsumer: Flushing stream, size = 2235
   2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013)
   hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235
   lastFlushOffset 1986
   2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66)
   hdfs.HdfsQueueConsumer: Flushed stream, size = 2235
  
   So although the Offset is changing as expected, the output stream isn't
   being flushed or cleared out and isn't being written to file unless the
   stream is closed() ... is this the expected behaviour?
  
   -sd
  
 



 --
 Sasha Dolgy
 sasha.do...@gmail.com




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Datanodes fail to start

2009-05-15 Thread jason hadoop
 got processed in 11 msecs
 2009-05-14 13:36:15,454 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 12 msecs
 2009-05-14 14:36:13,662 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 9 msecs
 2009-05-14 15:36:14,930 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 13 msecs
 2009-05-14 16:36:16,151 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 12 msecs
 2009-05-14 17:36:14,407 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 9 msecs
 2009-05-14 18:36:15,659 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
 82 blocks got processed in 10 msecs
 2009-05-14 19:27:02,188 WARN org.apache.hadoop.dfs.DataNode:
 java.io.IOException: Call to
 hadoopmaster.utdallas.edu/10.110.95.61:9000failed on local except$
at org.apache.hadoop.ipc.Client.wrapException(Client.java:751)
at org.apache.hadoop.ipc.Client.call(Client.java:719)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source)
at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:690)
at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2967)
at java.lang.Thread.run(Thread.java:619)
 Caused by: java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at
 org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:500)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:442)

 2009-05-14 19:27:06,198 INFO org.apache.hadoop.ipc.Client: Retrying connect
 to server: hadoopmaster.utdallas.edu/10.110.95.61:9000. Already tried 0
 time(s).
 2009-05-14 19:27:06,436 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at Slave1/127.0.1.1
 /
 2009-05-14 19:27:21,737 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = Slave1/127.0.1.1


 On Thu, May 14, 2009 at 11:43 PM, jason hadoop jason.had...@gmail.com
 wrote:

  The data node logs are on the datanode machines in the log directory.
  You may wish to buy my book and read chapter 4 on hdfs management.
 
  On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com
 wrote:
 
   Can u guide me where can I find datanode log files? As I cannot find it
  in
   $hadoop/logs and so.
  
   I can only find  following files in logs folder :-
  
   hadoop-hadoop-namenode-hadoopmaster.log
  hadoop-hadoop-namenode-hadoopmaster.out
   hadoop-hadoop-namenode-hadoopmaster.out.1
 hadoop-hadoop-secondarynamenode-hadoopmaster.log
   hadoop-hadoop-secondarynamenode-hadoopmaster.out
   hadoop-hadoop-secondarynamenode-hadoopmaster.out.1
  history
  
  
   Thanks
   Pankil
  
   On Thu, May 14, 2009 at 11:27 PM, jason hadoop jason.had...@gmail.com
   wrote:
  
You have to examine the datanode log files
the namenode does not start the datanodes, the start script does.
The name node passively waits for the datanodes to connect to it.
   
On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com
   wrote:
   
 Hello Everyone,

 Actually I had a cluster which was up.

 But i stopped the cluster as i  wanted to format it.But cant start
 it
back.

 1)when i give start-dfs.sh I get following on screen

 starting namenode, logging to

  
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
 slave1.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
 slave3.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
 slave4.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
 slave2.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
 slave5.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
 slave6.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
 slave9.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
 slave8.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
 slave7.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
 slave10.local: starting datanode, logging to

 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
 hadoopmaster.local: starting

Re: hadoop streaming binary input / image processing

2009-05-15 Thread jason hadoop
My apologies Piotr, I was referring to the streaming case and then pulling
the file out of a shared file systems, not using an input split that
contains the image data as you suggest.

On Thu, May 14, 2009 at 11:50 PM, Piotr Praczyk piotr.prac...@gmail.comwrote:

 Depends what API do you use. When writing an InputSplit implementation, it
 is possible to specify on which nodes does the data reside. I am new to
 Hadoop, but as far as I know, doing this
 should enable the support for data locality. Moreover, implementing a
 subclass of TextInputFormat and adding some encoding on the fly should not
 impact any locality properties.


 Piotr


 2009/5/15 jason hadoop jason.had...@gmail.com

  A  downside of this approach is that you will not likely have data
 locality
  for the data on shared file systems, compared with data coming from an
  input
  split.
  That being said,
  from your script, *hadoop dfs -get FILE -* will write the file to
 standard
  out.
 
  On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk piotr.prac...@gmail.com
  wrote:
 
   just in addition to my previous post...
  
   You don't have to store the enceded files in a file system of course
  since
   you can write your own InoutFormat which wil do this on the fly... the
   overhead should not be that big.
  
   Piotr
  
   2009/5/14 Piotr Praczyk piotr.prac...@gmail.com
  
Hi
   
If you want to read the files form HDFS and can not pass the binary
  data,
you can do some encoding of it (base 64 for example, but you can
 think
   about
sth more efficient since the range of characters accprable in the
 input
string is wider than that used by BASE64). It should solve the
 problem
   until
some king of binary input is supported ( is it going to happen? ).
   
Piotr
   
2009/5/14 openresearch qiming...@openresearchinc.com
   
   
All,
   
I have read some recommendation regarding image (binary input)
   processing
using Hadoop-streaming which only accept text out-of-box for now.
http://hadoop.apache.org/core/docs/current/streaming.html
https://issues.apache.org/jira/browse/HADOOP-1722
http://markmail.org/message/24woaqie2a6mrboc
   
However, I have not got any straight answer.
   
One recommendation is to put image data on HDFS, but we have to do
  hdf
-get for each file/dir and process it locally which is every
  expensive.
   
Another recommendation is to ...put them in a centralized place
 where
   all
the hadoop nodes can access them (via .e.g, NFS mount)...
 Obviously,
  IO
will becomes bottleneck and it defeat the purpose of distributed
processing.
   
I also notice some enhancement ticket is open for hadoop-core. Is it
committed to any svn (0.21) branch? can somebody show me an example
  how
   to
take *.jpg files (from HDFS), and process files in a distributed
  fashion
using streaming?
   
Many thanks
   
-Qiming
--
View this message in context:
   
  
 
 http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.
   
   
   
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Is Mapper's map method thread safe?

2009-05-14 Thread jason hadoop
Ultimately it depends on how you write the Mapper.map method.
The framework supports a MultithreadedMapRunner which lets you set the
number of threads running your map method simultaneously.
Chapter 5 of my book covers this.


On Wed, May 13, 2009 at 11:10 PM, Shengkai Zhu geniusj...@gmail.com wrote:

 Each mapper instance will be executed in separate JVM.

 On Thu, May 14, 2009 at 2:04 PM, imcaptor imcap...@gmail.com wrote:

  Dear all:
 
  Any one knows Is Mapper's map method thread safe?
  Thank you!
 
  imcaptor
 



 --

 朱盛凯

 Jash Zhu

 复旦大学软件学院

 Software School, Fudan University




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Selective output based on keys

2009-05-14 Thread jason hadoop
The customary practice is to have your Reducer.reduce method handle the
filtering if you are reducing your output.
or the Mapper.map method if you are not.

On Wed, May 13, 2009 at 1:57 PM, Asim linka...@gmail.com wrote:

 Hi,

 I wish to output only selective records to the output files based on
 keys. Is it possible to selectively write keys by setting
 setOutputKeyClass. Kindly let me know.

 Regards,
 Asim




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Setting up another machine as secondary node

2009-05-14 Thread jason hadoop
any machine put in the conf/masters file becomes a secondary namenode.

At some point there was confusion on the safety of more than one machine,
which I believe was settled, as many are safe.

The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and sends that back to the namenode.
There is some performance advantage of having it on the local machine, and
some safety advantage of having it on an alternate machine.
Could someone who remembers speak up on the single vrs multiple secondary
namenodes?


On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com wrote:

 First of all, the secondary namenode is not a what you might think a
 secondary is - it's not failover device.  It does make a copy of the
 filesystem metadata periodically, and it integrates the edits into the
 image.  It does *not* provide failover.

 Second, you specify its IP address in hadoop-site.xml.  This is where you
 can override the defaults set in hadoop-default.xml.

 dbr

 On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani rakhi.khatw...@gmail.com
 wrote:

  Hi,
  I wanna set up a cluster of 5 nodes in such a way that
  node1 - master
  node2 - secondary namenode
  node3 - slave
  node4 - slave
  node5 - slave
 
 
  How do we go about that?
  there is no property in hadoop-env where i can set the ip-address for
  secondary name node.
 
  if i set node-1 and node-2 in masters, and when we start dfs, in both the
  m/cs, the namenode n secondary namenode processes r present. but i think
  only node1 is active.
  n my namenode fail over operation fails.
 
  ny suggesstions?
 
  Regards,
  Rakhi
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
Sort order is preserved if your Mapper doesn't change the key ordering in
output. Partition name is not preserved.

What I have done is to manually work out what the partition number of the
output file should be for each map task, by calling the partitioner on an
input key, and then renaming the output in the close method.

Conceptually the place for this dance is in the OutputCommitter, but I
haven't used them in production code, and my mapside join examples come from
before they were available.

the Hadoop join framework handles setting the split size to Long.MAX_VALUE
for you.

If you put up a discussion question on www.prohadoopbook.com, I will fill in
the example on how to do this.

On Thu, May 14, 2009 at 8:04 AM, Stuart White stuart.whi...@gmail.comwrote:

 I'm implementing a map-side join as described in chapter 8 of Pro
 Hadoop.  I have two files that have been partitioned using the
 TotalOrderPartitioner on the same key into the same number of
 partitions.  I've set mapred.min.split.size to Long.MAX_VALUE so that
 one Mapper will handle an entire partition.

 I want the output to be written in the same partitioned, total sort
 order.  If possible, I want to accomplish this by setting my
 NumReducers to 0 and having the output of my Mappers written directly
 to HDFS, thereby skipping the partition/sort step.

 My question is this: Am I guaranteed that the Mapper that processes
 part-0 will have its output written to the output file named
 part-0, the Mapper that processes part-1 will have its output
 written to part-1, etc... ?

 If so, then I can preserve the partitioning/sort order of my input
 files without re-partitioning and re-sorting.

 Thanks.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can decommission the datanode, and then un-decommission it.

On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu
alexandra.al...@gmail.comwrote:


 Hi,

 I want to test how Hadoop and HBase are performing. I have a cluster with 1
 namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.

 I first ran a few tests when the 4 datanodes use local storage specified in
 dfs.data.dir.
 Now, I want to see what is the tradeoff if I switch from local storage to
 network mounted storage (I know it sounds like a crazy idea but
 unfortunately I have to explore this possibility).

 I would like to be able to change the dfs.data.dir and maybe in two steps
 be
 able to switch to the network mounted storage.

 What I had in mind was the following steps :

 0. Assume initial status is a working cluster with local storage, e.g.
 dfs.data.dir set to local_storage_path.
 1. Stop cluster: bin/stop-dfs
 2. Change dfs.data.dir by adding the network_storage_path to the local
 storage_path.
 3. Start cluster: bin/start-dfs (this will format the new network
 locations,
 which is nice)
 4. Perform some sort of directed balancing of all the data towards the
 network storage location
 5. Stop cluster: bin/stop-dfs
 6. Change dfs.data.dir parameter to only contain local_storage_path
 7.  Start cluster and live happily ever after :-).

 The problem is , I don;t know if there is a command or an option to achieve
 step 4.
 Do you have any suggestions ?

 I found some info on how to add datanodes, but there is not much info on
 how
 to remove safely (without losing data etc) datanodes or storage locations
 on
 a particular node.
 Is this possible?

 Many thanks,
 Alexandra.





 --
 View this message in context:
 http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23542127.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
In the mapside join, the input file name is not visible. as the input is
actually a composite a large number of files.

I have started answering in www.prohadoopbook.com

On Thu, May 14, 2009 at 1:19 PM, Stuart White stuart.whi...@gmail.comwrote:

 On Thu, May 14, 2009 at 10:25 AM, jason hadoop jason.had...@gmail.com
 wrote:
  If you put up a discussion question on www.prohadoopbook.com, I will
 fill in
  the example on how to do this.

 Done.  Thanks!

 http://www.prohadoopbook.com/forum/topics/preserving-partition-file




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can have separate configuration files for the different datanodes.

If you are willing to deal with the complexity you can manually start them
with altered properties from the command line.

rsync or other means of sharing  identical configs is simple and common.

Raghu, your technique will only work well if you can complete steps 1-4 in
less than the datanode timeout interval, which may be valid for Alexandria.
I believe the timeout is 10 minutes.
If you pass the timeout interval the namenode will start to rebalance the
blocks, and when the datanode comes back it will delete all of the blocks it
has rebalanced.

On Thu, May 14, 2009 at 11:35 AM, Raghu Angadi rang...@yahoo-inc.comwrote:


 Along these lines, even simpler approach I would think is :

 1) set data.dir to local and create the data.
 2) stop the datanode
 3) rsync local_dir network_dir
 4) start datanode with data.dir with network_dir

 There is no need to format or rebalnace.

 This way you can switch between local and network multiple times (without
 needing to rsync data, if there are no changes made in the tests)

 Raghu.


 Alexandra Alecu wrote:

 Another possibility I am thinking about now, which is suitable for me as I
 do
 not actually have much data stored in the cluster when I want to perform
 this switch is to set the replication level really high and then simply
 remove the local storage locations and restart the cluster. With a bit of
 luck the high level of replication will allow a full recovery of the
 cluster
 on restart.

 Is this something that you would advice?

 Many thanks,
 Alexandra.





-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: hadoop streaming binary input / image processing

2009-05-14 Thread jason hadoop
A  downside of this approach is that you will not likely have data locality
for the data on shared file systems, compared with data coming from an input
split.
That being said,
from your script, *hadoop dfs -get FILE -* will write the file to standard
out.

On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk piotr.prac...@gmail.comwrote:

 just in addition to my previous post...

 You don't have to store the enceded files in a file system of course since
 you can write your own InoutFormat which wil do this on the fly... the
 overhead should not be that big.

 Piotr

 2009/5/14 Piotr Praczyk piotr.prac...@gmail.com

  Hi
 
  If you want to read the files form HDFS and can not pass the binary data,
  you can do some encoding of it (base 64 for example, but you can think
 about
  sth more efficient since the range of characters accprable in the input
  string is wider than that used by BASE64). It should solve the problem
 until
  some king of binary input is supported ( is it going to happen? ).
 
  Piotr
 
  2009/5/14 openresearch qiming...@openresearchinc.com
 
 
  All,
 
  I have read some recommendation regarding image (binary input)
 processing
  using Hadoop-streaming which only accept text out-of-box for now.
  http://hadoop.apache.org/core/docs/current/streaming.html
  https://issues.apache.org/jira/browse/HADOOP-1722
  http://markmail.org/message/24woaqie2a6mrboc
 
  However, I have not got any straight answer.
 
  One recommendation is to put image data on HDFS, but we have to do hdf
  -get for each file/dir and process it locally which is every expensive.
 
  Another recommendation is to ...put them in a centralized place where
 all
  the hadoop nodes can access them (via .e.g, NFS mount)... Obviously, IO
  will becomes bottleneck and it defeat the purpose of distributed
  processing.
 
  I also notice some enhancement ticket is open for hadoop-core. Is it
  committed to any svn (0.21) branch? can somebody show me an example how
 to
  take *.jpg files (from HDFS), and process files in a distributed fashion
  using streaming?
 
  Many thanks
 
  -Qiming
  --
  View this message in context:
 
 http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
You have to examine the datanode log files
the namenode does not start the datanodes, the start script does.
The name node passively waits for the datanodes to connect to it.

On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote:

 Hello Everyone,

 Actually I had a cluster which was up.

 But i stopped the cluster as i  wanted to format it.But cant start it back.

 1)when i give start-dfs.sh I get following on screen

 starting namenode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
 slave1.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
 slave3.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
 slave4.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
 slave2.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
 slave5.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
 slave6.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
 slave9.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
 slave8.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
 slave7.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
 slave10.local: starting datanode, logging to
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
 hadoopmaster.local: starting secondarynamenode, logging to

 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out


 2) from log file named hadoop-hadoop-namenode-hadoopmaster.log I get
 following



 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = hadoopmaster/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.18.3
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
 736250;
 compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
 /
 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=NameNode, port=9000
 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up
 at:
 hadoopmaster.local/192.168.0.1:9000
 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:

 fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
 supergroup=supergroup
 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
 isPermissionEnabled=true
 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
 Initializing FSNamesystemMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of files
 = 1
 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of files
 under construction = 0
 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of
 size 80 loaded in 0 seconds.
 2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file
 edits
 of size 4 edits # 0 loaded in 0 seconds.
 2009-05-14 20:28:23,974 INFO org.apache.hadoop.fs.FSNamesystem: Finished
 loading FSImage in 155 msecs
 2009-05-14 20:28:23,976 INFO org.apache.hadoop.fs.FSNamesystem: Total
 number
 of blocks = 0
 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
 invalid blocks = 0
 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
 under-replicated blocks = 0
 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of
 over-replicated blocks = 0
 2009-05-14 20:28:23,988 INFO org.apache.hadoop.dfs.StateChange: STATE*
 Leaving safe mode after 0 secs.
 *2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
 Network topology has 0 racks and 0 datanodes*
 2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE*
 UnderReplicatedBlocks has 0 blocks
 2009-05-14 20:28:29,128 INFO org.mortbay.util.Credential: Checking Resource
 aliases
 2009-05-14 20:28:29,243 INFO org.mortbay.http.HttpServer: Version
 

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
The data node logs are on the datanode machines in the log directory.
You may wish to buy my book and read chapter 4 on hdfs management.

On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com wrote:

 Can u guide me where can I find datanode log files? As I cannot find it in
 $hadoop/logs and so.

 I can only find  following files in logs folder :-

 hadoop-hadoop-namenode-hadoopmaster.log
hadoop-hadoop-namenode-hadoopmaster.out
 hadoop-hadoop-namenode-hadoopmaster.out.1
   hadoop-hadoop-secondarynamenode-hadoopmaster.log
 hadoop-hadoop-secondarynamenode-hadoopmaster.out
 hadoop-hadoop-secondarynamenode-hadoopmaster.out.1
history


 Thanks
 Pankil

 On Thu, May 14, 2009 at 11:27 PM, jason hadoop jason.had...@gmail.com
 wrote:

  You have to examine the datanode log files
  the namenode does not start the datanodes, the start script does.
  The name node passively waits for the datanodes to connect to it.
 
  On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com
 wrote:
 
   Hello Everyone,
  
   Actually I had a cluster which was up.
  
   But i stopped the cluster as i  wanted to format it.But cant start it
  back.
  
   1)when i give start-dfs.sh I get following on screen
  
   starting namenode, logging to
  
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
   slave1.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out
   slave3.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out
   slave4.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out
   slave2.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out
   slave5.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out
   slave6.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out
   slave9.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out
   slave8.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out
   slave7.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out
   slave10.local: starting datanode, logging to
   /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out
   hadoopmaster.local: starting secondarynamenode, logging to
  
  
 
 /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out
  
  
   2) from log file named hadoop-hadoop-namenode-hadoopmaster.log I get
   following
  
  
  
   2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode:
 STARTUP_MSG:
   /
   STARTUP_MSG: Starting NameNode
   STARTUP_MSG:   host = hadoopmaster/127.0.0.1
   STARTUP_MSG:   args = []
   STARTUP_MSG:   version = 0.18.3
   STARTUP_MSG:   build =
   https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
   736250;
   compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009
   /
   2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
   Initializing RPC Metrics with hostName=NameNode, port=9000
   2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode
 up
   at:
   hadoopmaster.local/192.168.0.1:9000
   2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
   Initializing JVM Metrics with processName=NameNode, sessionId=null
   2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics:
   Initializing NameNodeMeterics using context
   object:org.apache.hadoop.metrics.spi.NullContext
   2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
  
  
 
 fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare
   2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
   supergroup=supergroup
   2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem:
   isPermissionEnabled=true
   2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics:
   Initializing FSNamesystemMeterics using context
   object:org.apache.hadoop.metrics.spi.NullContext
   2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem:
  Registered
   FSNamesystemStatusMBean
   2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of
  files
   = 1
   2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of
  files
   under construction = 0
   2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file
 of
   size 80 loaded in 0 seconds.
   2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file
   edits
   of size 4 edits # 0 loaded in 0 seconds.
   2009-05-14 20:28:23,974

Re: hadoop streaming reducer values

2009-05-13 Thread jason hadoop
You may wish to set the separator to the string comma space ', ' for your
example.
chapter 7 of my book goes into this in some detail, and I posted a graphic
that visually depicts the process and the values
about a month ago.
The original post was titled 'Changing key/value separator in hadoop
streaming'
and I have attached the graphic.


On Tue, May 12, 2009 at 7:55 PM, Alan Drew drewsk...@yahoo.com wrote:


 Hi,

 I have a question about the key, values that the reducer gets in Hadoop
 Streaming.

 I wrote a simple mapper.sh, reducer.sh script files:

 mapper.sh :

 #!/bin/bash

 while read data
 do
  #tokenize the data and output the values word, 1
  echo $data | awk '{token=0; while(++token=NF) print $token\t1}'
 done

 reducer.sh :

 #!/bin/bash

 while read data
 do
  echo -e $data
 done

 The mapper tokenizes a line of input and outputs word, 1 pairs to
 standard
 output.  The reducer just outputs what it gets from standard input.

 I have a simple input file:

 cat in the hat
 ate my mat the

 I was expecting the final output to be something like:

 the 1 1 1
 cat 1

 etc.

 but instead each word has its own line, which makes me think that
 key,value is being given to the reducer and not key, values which is
 default for normal Hadoop (in Java) right?

 the 1
 the 1
 the 1
 cat 1

 Is there any way to get key, values for the reducer and not a bunch of
 key, value pairs?  I looked into the -reducer aggregate option, but there
 doesn't seem to be a way to customize what the reducer does with the key,
 values other than max,min functions.

 Thanks.
 --
 View this message in context:
 http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: How can I get the actual time for one write operation in HDFS?

2009-05-13 Thread jason hadoop
Close the file after you write one block, the close is synchronous.

On Tue, May 12, 2009 at 11:50 PM, Xie, Tao xietao1...@gmail.com wrote:


 DFSOutputStream.writeChunk() enqueues packets into data queue and after
 that
 it returns. So write is asynchronous.

 I want to know the total actual time of HDFS executing the write operation
 (start from writeChunk() to the time that each replication is written on
 disk). How can get that time?

 Thanks.

 --
 View this message in context:
 http://www.nabble.com/How-can-I-get-the-actual-time-for-one-write-operation-in-HDFS--tp23516363p23516363.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: hadoop streaming reducer values

2009-05-13 Thread jason hadoop
Thanks chuck, I didn't read the post and focused on the commas


On Wed, May 13, 2009 at 2:38 PM, Chuck Lam chuck@gmail.com wrote:

 The behavior you saw in Streaming (list of key,value instead of key,
 list
 of values) is indeed intentional, and it's part of the design differences
 between Streaming and Hadoop Java. That is, in Streaming your reducer is
 responsible for grouping values of the same key, whereas in Java the
 grouping is done for you.

 However, the input to your reducer is still sorted (and partitioned) on the
 key, so all key/value pairs of the same key will arrive at your reducer in
 one contiguous chunk. Your reducer can keep a last_key variable to track
 whether all records of the same key have been read in. In Python a reducer
 that sums up all values of a key is like this:

 #!/usr/bin/env python

 import sys

 (last_key, sum) = (None, 0.0)

 for line in sys.stdin:
(key, val) = line.split(\t)

if last_key and last_key != key:
print last_key + \t + str(sum)
sum = 0.0

last_key = key
sum   += float(val)

 print last_key + \t + str(sum)


 Streaming is covered in all 3 upcoming Hadoop books. The above is an
 example
 from mine ;)  http://www.manning.com/lam/ . Tom White has the definite
 guide
 from O'Reilly - http://www.hadoopbook.com/ . Jason has
 http://www.apress.com/book/view/9781430219422





 On Tue, May 12, 2009 at 7:55 PM, Alan Drew drewsk...@yahoo.com wrote:

 
  Hi,
 
  I have a question about the key, values that the reducer gets in Hadoop
  Streaming.
 
  I wrote a simple mapper.sh, reducer.sh script files:
 
  mapper.sh :
 
  #!/bin/bash
 
  while read data
  do
   #tokenize the data and output the values word, 1
   echo $data | awk '{token=0; while(++token=NF) print $token\t1}'
  done
 
  reducer.sh :
 
  #!/bin/bash
 
  while read data
  do
   echo -e $data
  done
 
  The mapper tokenizes a line of input and outputs word, 1 pairs to
  standard
  output.  The reducer just outputs what it gets from standard input.
 
  I have a simple input file:
 
  cat in the hat
  ate my mat the
 
  I was expecting the final output to be something like:
 
  the 1 1 1
  cat 1
 
  etc.
 
  but instead each word has its own line, which makes me think that
  key,value is being given to the reducer and not key, values which is
  default for normal Hadoop (in Java) right?
 
  the 1
  the 1
  the 1
  cat 1
 
  Is there any way to get key, values for the reducer and not a bunch of
  key, value pairs?  I looked into the -reducer aggregate option, but
 there
  doesn't seem to be a way to customize what the reducer does with the
 key,
  values other than max,min functions.
 
  Thanks.
  --
  View this message in context:
 
 http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: sub 60 second performance

2009-05-11 Thread jason hadoop
You would need to read the data, and store it in an internal data structure,
or copy the file to a local file system file and mmap it if you didn't want
to store it in java heap space.

Your map then has to deal with the fact that the data isn't being passed in
directly.
This is not straight forward to do.


On Sun, May 10, 2009 at 4:53 PM, Matt Bowyer mattbowy...@googlemail.comwrote:

 Thanks Jason, how can I get access to the particular block?

 do you mean create a static map inside the task (add the values).. and
 check
 if populated on the next run?

 or is there a more elegant/triedtested solution?

 thanks again

 On Mon, May 11, 2009 at 12:41 AM, jason hadoop jason.had...@gmail.com
 wrote:

  You can cache the block in your task, in a pinned static variable, when
 you
  are reusing the jvms.
 
  On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.com
  wrote:
 
   Hi,
  
   I am trying to do 'on demand map reduce' - something which will return
 in
   reasonable time (a few seconds).
  
   My dataset is relatively small and can fit into my datanode's memory.
 Is
  it
   possible to keep a block in the datanode's memory so on the next job
 the
   response will be much quicker? The majority of the time spent during
 the
   job
   run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
   tried
   using the setNumTasksToExecutePerJvm but the block still seems to be
   cleared
   from memory after the job.
  
   thanks!
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Re-Addressing a cluster

2009-05-11 Thread jason hadoop
Now that I think about it, the reverse lookups in my clusters work.

On Mon, May 11, 2009 at 3:07 AM, Steve Loughran ste...@apache.org wrote:

 jason hadoop wrote:

 You should be able to relocate the cluster's IP space by stopping the
 cluster, modifying the configuration files, resetting the dns and starting
 the cluster.
 Be best to verify connectivity with the new IP addresses before starting
 the
 cluster.

 to the best of my knowledge the namenode doesn't care about the ip
 addresses
 of the datanodes, only what blocks they report as having. The namenode
 does
 care about loosing contact with a connected datanode,  replicating the
 blocks that are now under replicated.

 I prefer IP addresses in my configuration files but that is a personal
 preference not a requirement.



 I do deployments on to Virtual clusters without fully functional reverse
 DNS, things do work badly in that situation. Hadoop assumes that if a
 machine looks up its hostname, it can pass that to peers and they can
 resolve it, the well managed network infrastructure assumption.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: sub 60 second performance

2009-05-10 Thread jason hadoop
You can cache the block in your task, in a pinned static variable, when you
are reusing the jvms.

On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.comwrote:

 Hi,

 I am trying to do 'on demand map reduce' - something which will return in
 reasonable time (a few seconds).

 My dataset is relatively small and can fit into my datanode's memory. Is it
 possible to keep a block in the datanode's memory so on the next job the
 response will be much quicker? The majority of the time spent during the
 job
 run appears to be during the 'HDFS_BYTES_READ' part of the job. I have
 tried
 using the setNumTasksToExecutePerJvm but the block still seems to be
 cleared
 from memory after the job.

 thanks!




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: large files vs many files

2009-05-09 Thread jason hadoop
You must create unique file names, I don't believe (but I do not know) that
the append could will allow multiple writers.

Are you writing from within a task, or as an external application writing
into hadoop.

You may try using UUID,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of your
filename.
Without knowing more about your goals, environment and constraints it is
hard to offer any more detailed suggestions.
You could also have an application aggregate the streams and write out
chunks, with one or more writers, one per output file.


On Sat, May 9, 2009 at 12:15 AM, Sasha Dolgy sdo...@gmail.com wrote:

 yes, that is the problem.  two or hundreds...data streams in very quickly.

 On Fri, May 8, 2009 at 8:42 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Is it possible that two tasks are trying to write to the same file path?
 
 
  On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote:
 
   Hi Tom (or anyone else),
   Will SequenceFile allow me to avoid problems with concurrent writes to
  the
   file?  I stll continue to get the following exceptions/errors in hdfs:
  
   org.apache.hadoop.ipc.RemoteException:
   org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
   failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
   DFSClient_-1821265528
   on client 127.0.0.1 because current leaseholder is trying to recreate
  file.
  
   Only happens when two processes are trying to write at the same time.
   Now
   ideally I don't want to buffer the data that's coming in and i want to
  get
   it out and into the file asap to avoid any data loss...am i missing
   something here?  is there some sort of factory i can implement to help
 in
   writing a lot of simultaneous data streams?
  
   thanks in advance for any suggestions
   -sasha
  
   On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote:
  
Hi Sasha,
   
As you say, HDFS appends are not yet working reliably enough to be
suitable for production use. On the other hand, having lots of little
files is bad for the namenode, and inefficient for MapReduce (see
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/),
 so
it's best to avoid this too.
   
I would recommend using SequenceFile as a storage container for lots
of small pieces of data. Each key-value pair would represent one of
your little files (you can have a null key, if you only need to store
the contents of the file). You can also enable compression (use block
compression), and SequenceFiles are designed to work well with
MapReduce.
   
Cheers,
   
Tom
   
On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com
wrote:
 hi there,
 working through a concept at the moment and was attempting to write
   lots
of
 data to few files as opposed to writing lots of data to lots of
  little
 files.  what are the thoughts on this?

 When I try and implement outputStream = hdfs.append(path); there
   doesn't
 seem to be any locking mechanism in place ... or there is and it
   doesn't
 work well enough for many writes per second?

 i have read and seen that the property dfs.support.append is not
   meant
for
 production use.  still, if millions of little files are as good or
   better
 --- or no difference -- to a few massive files then i suppose
 append
isn't
 something i really need.

 I do see a lot of stack traces with messages like:

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
 failed
  to
 create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528
 on
client
 127.0.0.1 because current leaseholder is trying to recreate file.

 i hope this make sense.  still a little bit confused.

 thanks in advance
 -sd

 --
 Sasha Dolgy
 sasha.do...@gmail.com
   
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 



 --
 Sasha Dolgy
 sasha.do...@gmail.com




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: ClassNotFoundException

2009-05-09 Thread jason hadoop
rel is short for the hadoop version you are using 0.18.x, 0.19.x or 0.20.x
etc
You must make all of the required jars available to all of your tasks. You
can either install them all the tasktracker machines and setup the
tasktracker classpath to include them, or distributed them via the
distributed cache.

chapter 5 of my book goes into this in some detail, and is available now as
a download. http://www.apress.com/book/view/9781430219422

On Fri, May 8, 2009 at 4:22 PM, georgep p09...@gmail.com wrote:


 Sorry, I misspell you name, Jason

 George

 georgep wrote:
 
  Hi Joe,
 
  Thank you for the reply, but do I need to include every supporting jar
  file to the application path? What is the -rel-?
 
  George
 
 
  jason hadoop wrote:
 
  1) when running under windows, include the cygwin bin directory in your
  windows path environment variable
  2) eclipse is not so good at submitting supporting jar files, in your
  application lauch path add a -libjars path/hadoop-rel-examples.jar.
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/ClassNotFoundException-tp23441528p23455206.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Error when start hadoop cluster.

2009-05-09 Thread jason hadoop
looks like you have different versions of the jars, or perhaps a someone has
run ant in one of your installation directories.

On Fri, May 8, 2009 at 7:54 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote:

 Hi all!


 I cannot start hdfs successful. I checked log file and found following
 message:

 2009-05-09 08:17:55,026 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = haris1.asnet.local/192.168.1.180
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.18.2
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r
 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008
 /
 2009-05-09 08:17:55,302 ERROR org.apache.hadoop.dfs.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /tmp/hadoop-root/dfs/data: namenode namespaceID = 880518114; datanode
 namespaceID = 461026751
at
 org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:226)
at

 org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:141)
at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:306)
at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223)
at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3031)
at
 org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2986)
at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2994)
at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3116)

 2009-05-09 08:17:55,303 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at haris1.asnet.local/192.168.1.180
 /

 Please help me!


 P/s: I's using Hadoop-0.18.2 and Hbase 0.18.1


 Thanks,

 Best.

 Nguyen




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Most efficient way to support shared content among all mappers

2009-05-09 Thread jason hadoop
Thanks Jeff!

On Sat, May 9, 2009 at 1:31 PM, Jeff Hammerbacher ham...@cloudera.comwrote:

 Hey,

 For a more detailed discussion of how to use memcached for this purpose,
 see
 the paper Low-Latency, High-Throughput Access to Static Global Resources
 within the Hadoop Framework:
 http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_etal_TR2009.pdf
 .

 Regards,
 Jeff

 On Fri, May 8, 2009 at 2:49 PM, jason hadoop jason.had...@gmail.com
 wrote:

  Most of the people with this need are using some variant of memcached, or
  other distributed hash table.
 
  On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote:
 
  
   Hi,
   As a newcomer to Hadoop, I wonder any efficient way to support shared
   content among all mappers. For example, to implement an neural network
   algorithm, I want the NN data structure accessible by all mappers.
   Thanks for your comments!
   - Joe
  
  
  
  
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Re-Addressing a cluster

2009-05-09 Thread jason hadoop
You should be able to relocate the cluster's IP space by stopping the
cluster, modifying the configuration files, resetting the dns and starting
the cluster.
Be best to verify connectivity with the new IP addresses before starting the
cluster.

to the best of my knowledge the namenode doesn't care about the ip addresses
of the datanodes, only what blocks they report as having. The namenode does
care about loosing contact with a connected datanode,  replicating the
blocks that are now under replicated.

I prefer IP addresses in my configuration files but that is a personal
preference not a requirement.

On Sat, May 9, 2009 at 11:51 AM, John Kane john.k...@kane.net wrote:

 I have a situation that I have not been able to find in the mail archives.

 I have an active cluster that was built on a private switch with private IP
 address space (192.168.*.*)

 I need to relocate the cluster into real address space.

 Assuming that I have working DNS, is there an issue? Do I just need to be
 sure that I utilize hostnames for everything and then be fat, dumb and
 happy? Or are IP Addresses tracked by the namenode, etc?

 Thanks




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: large files vs many files

2009-05-08 Thread jason hadoop
Is it possible that two tasks are trying to write to the same file path?


On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Hi Tom (or anyone else),
 Will SequenceFile allow me to avoid problems with concurrent writes to the
 file?  I stll continue to get the following exceptions/errors in hdfs:

 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:
 failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for
 DFSClient_-1821265528
 on client 127.0.0.1 because current leaseholder is trying to recreate file.

 Only happens when two processes are trying to write at the same time.  Now
 ideally I don't want to buffer the data that's coming in and i want to get
 it out and into the file asap to avoid any data loss...am i missing
 something here?  is there some sort of factory i can implement to help in
 writing a lot of simultaneous data streams?

 thanks in advance for any suggestions
 -sasha

 On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote:

  Hi Sasha,
 
  As you say, HDFS appends are not yet working reliably enough to be
  suitable for production use. On the other hand, having lots of little
  files is bad for the namenode, and inefficient for MapReduce (see
  http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so
  it's best to avoid this too.
 
  I would recommend using SequenceFile as a storage container for lots
  of small pieces of data. Each key-value pair would represent one of
  your little files (you can have a null key, if you only need to store
  the contents of the file). You can also enable compression (use block
  compression), and SequenceFiles are designed to work well with
  MapReduce.
 
  Cheers,
 
  Tom
 
  On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com
  wrote:
   hi there,
   working through a concept at the moment and was attempting to write
 lots
  of
   data to few files as opposed to writing lots of data to lots of little
   files.  what are the thoughts on this?
  
   When I try and implement outputStream = hdfs.append(path); there
 doesn't
   seem to be any locking mechanism in place ... or there is and it
 doesn't
   work well enough for many writes per second?
  
   i have read and seen that the property dfs.support.append is not
 meant
  for
   production use.  still, if millions of little files are as good or
 better
   --- or no difference -- to a few massive files then i suppose append
  isn't
   something i really need.
  
   I do see a lot of stack traces with messages like:
  
   org.apache.hadoop.ipc.RemoteException:
   org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to
   create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on
  client
   127.0.0.1 because current leaseholder is trying to recreate file.
  
   i hope this make sense.  still a little bit confused.
  
   thanks in advance
   -sd
  
   --
   Sasha Dolgy
   sasha.do...@gmail.com
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Setting thread stack size for child JVM

2009-05-08 Thread jason hadoop
You an set the mapred.child.java.opts on a per job basis
either via -D mapred.child.java.ops=java options or via
conf.set(mapred.child.java.opts, java options).

Note: the conf.set must be done before the job is submitted.

On Fri, May 8, 2009 at 11:57 AM, Philip Zeyliger phi...@cloudera.comwrote:

 You could add -Xssn to the mapred.child.java.opts configuration
 setting.  That's controlling the Java stack size, which I think is the
 relevant bit for you.

 Cheers,

 -- Philip


 property
  namemapred.child.java.opts/name
  value-Xmx200m/value
  descriptionJava opts for the task tracker child processes.
  The following symbol, if present, will be interpolated: @taskid@ is
 replaced
  by current TaskID. Any other occurrences of '@' will go unchanged.
  For example, to enable verbose gc logging to a file named for the taskid
 in
  /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of:
-Xmx1024m -verbose:gc -Xloggc:/tmp/@tas...@.gc

  The configuration variable mapred.child.ulimit can be used to control the
  maximum virtual memory of the child processes.
  /description
 /property


 On Fri, May 8, 2009 at 11:16 AM, Ken Krugler kkrugler_li...@transpac.com
 wrote:

  Hi there,
 
  For a very specific type of reduce task, we currently need to use a large
  number of threads.
 
  To avoid running out of memory, I'd like to constrain the Linux stack
 size
  via a ulimit -s xxx shell script command before starting up the JVM. I
  could do this for the entire system at boot time, but it would be better
 to
  have it for just the Hadoop JVM(s).
 
  Any suggestions for how best to handle this?
 
  Thanks,
 
  -- Ken
  --
  Ken Krugler
  +1 530-210-6378
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Most efficient way to support shared content among all mappers

2009-05-08 Thread jason hadoop
Most of the people with this need are using some variant of memcached, or
other distributed hash table.

On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote:


 Hi,
 As a newcomer to Hadoop, I wonder any efficient way to support shared
 content among all mappers. For example, to implement an neural network
 algorithm, I want the NN data structure accessible by all mappers.
 Thanks for your comments!
 - Joe






-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: All keys went to single reducer in WordCount program

2009-05-07 Thread jason hadoop
Most likely the 3rd mapper ran as a speculative execution, and it is
possible that all of your keys hashed to a single partition. Also, if you
don't specify the default is to run a single reduce task.

From JobConf,
/**
   * Get configured the number of reduce tasks for this job. Defaults to
   * code1/code.
   *
   * @return the number of reduce tasks for this job.
   */
  public int getNumReduceTasks() { return getInt(mapred.reduce.tasks, *1*);
}


On Thu, May 7, 2009 at 3:54 AM, Miles Osborne mi...@inf.ed.ac.uk wrote:

 with such a small data set who knows what will happen:  you are
 probably hitting minimal limits of some kind

 repeat this with more data

 Miles

 2009/5/7 Foss User foss...@gmail.com:
  I have two reducers running on two different machines. I ran the
  example word count program with some of my own System.out.println()
  statements to see what is going on.
 
  There were 2 slaves each running datanode as well as tasktracker.
  There was one namenode and one jobtracker. I know there is a very
  elaborate setup for such a small cluster but I did it only to learn.
 
  I gave two input files, a.txt and b.txt with a few lines of english
  text. Now, here are my questions.
 
  (1) I found that three mapper tasks ran, all in the first slave. The
  first task processed the first file. The second task processed the
  second file. The third task didn't process anything. Why is it that
  the third task did not process anything? Why was this task created in
  the first place?
 
  (2) I found only one reducer task, on the second slave. It processed
  all the values for keys. keys were words in this case of Text type. I
  tried printing out the key.hashCode() for each key and some of them
  were even and some of them were odd. I was expecting the keys with
  even hashcodes to go to one slave and the others to go to another
  slave. Why didn't this happen?
 



 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: how to improve the Hadoop's capability of dealing with small files

2009-05-07 Thread jason hadoop
The way I typically address that is to write a zip file using the zip
utilities. Commonly for output.
HDFS is not optimized for low latency, but for high through put for bulk
operations.

2009/5/7 Edward Capriolo edlinuxg...@gmail.com

 2009/5/7 Jeff Hammerbacher ham...@cloudera.com:
  Hey,
 
  You can read more about why small files are difficult for HDFS at
  http://www.cloudera.com/blog/2009/02/02/the-small-files-problem.
 
  Regards,
  Jeff
 
  2009/5/7 Piotr Praczyk piotr.prac...@gmail.com
 
  If You want to use many small files, they are probably having the same
  purpose and struc?
  Why not use HBase instead of a raw HDFS ? Many small files would be
 packed
  together and the problem would disappear.
 
  cheers
  Piotr
 
  2009/5/7 Jonathan Cao jonath...@rockyou.com
 
   There are at least two design choices in Hadoop that have implications
  for
   your scenario.
   1. All the HDFS meta data is stored in name node memory -- the memory
  size
   is one limitation on how many small files you can have
  
   2. The efficiency of map/reduce paradigm dictates that each
  mapper/reducer
   job has enough work to offset the overhead of spawning the job.  It
  relies
   on each task reading contiguous chuck of data (typically 64MB), your
  small
   file situation will change those efficient sequential reads to larger
   number
   of inefficient random reads.
  
   Of course, small is a relative term?
  
   Jonathan
  
   2009/5/6 陈桂芬 chenguifen...@163.com
  
Hi:
   
In my application, there are many small files. But the hadoop is
  designed
to deal with many large files.
   
I want to know why hadoop doesn't support small files very well and
  where
is the bottleneck. And what can I do to improve the Hadoop's
 capability
   of
dealing with small files.
   
Thanks.
   
   
  
 
 
 When the small file problem comes up most of the talk centers around
 the inode table being in memory. The cloudera blog points out
 something:

 Furthermore, HDFS is not geared up to efficiently accessing small
 files: it is primarily designed for streaming access of large files.
 Reading through small files normally causes lots of seeks and lots of
 hopping from datanode to datanode to retrieve each small file, all of
 which is an inefficient data access pattern.

 My application attempted to load 9000 6Kb files using a single
 threaded application and the FSOutpustStream objects to write directly
 to hadoop files. My plan was to have hadoop merge these files in the
 next step. I had to abandon this plan because this process was taking
 hours. I knew HDFS had a small file problem but I never realized
 that I could not do this problem the 'old fashioned way'. I merged the
 files locally and uploading a few small files gave great throughput.
 Small files is not just a permanent storage issue it is a serious
 optimization.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Large number of map output keys and performance issues.

2009-05-07 Thread jason hadoop
It may simply be that your JVM's are spending their time doing garbage
collection instead of running your tasks.
My book, in chapterr 6 has a section on how to tune your jobs, and how to
determine what to tune. That chapter is available now as an alpha.

On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Tiago,

 Here are a couple thoughts:

 1) How much data are you outputting? Obviously there is a certain amount of
 IO involved in actually outputting data versus not ;-)

 2) Are you using a reduce phase in this job? If so, since you're cutting
 off
 the data at map output time, you're also avoiding a whole sort computation
 which involves significant network IO, etc.

 3) What version of Hadoop are you running?

 Thanks
 -Todd

 On Wed, May 6, 2009 at 12:23 PM, Tiago Macambira macamb...@gmail.com
 wrote:

  I am developing a MR application w/ hadoop that is generating during it's
  map phase a really large number of output keys and it is having an
 abysmal
  performance.
 
  While just reading the said data takes 20 minutes and processing it but
 not
  outputting anything from the map takes around 30 min, running the full
  application takes around 4 hours. Is this a known or expected issue?
 
  Cheers.
  Tiago Alves Macambira
  --
  I may be drunk, but in the morning I will be sober, while you will
  still be stupid and ugly. -Winston Churchill
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Using multiple FileSystems in hadoop input

2009-05-07 Thread jason hadoop
I have used multiple file systems in jobs, but not used Har as one of them.
Worked for me in 18

On Wed, May 6, 2009 at 4:07 AM, Tom White t...@cloudera.com wrote:

 Hi Ivan,

 I haven't tried this combination, but I think it should work. If it
 doesn't it should be treated as a bug.

 Tom

 On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net
 wrote:
  Greetings to all,
 
  Could anyone suggest if Paths from different FileSystems can be used as
  input of Hadoop job?
 
  Particularly I'd like to find out whether Paths from HarFileSystem can be
  mixed with ones from DistributedFileSystem.
 
  Thanks,
 
 
  --
  Kind regards,
  Ivan
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: multi-line records and file splits

2009-05-06 Thread jason hadoop
Hey Tom, I had no luck using the StreamingXmlRecordReader for non XML files
are there any parameters that you need to  add in? I was testing with 0.19.0

On Wed, May 6, 2009 at 5:25 AM, Sharad Agarwal shara...@yahoo-inc.comwrote:


 The split doesn't need to be at the record boundary. If a mapper gets
 a partial record, it will seek to another split to get the full record.

 - Sharad




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: specific block size for a file

2009-05-05 Thread jason hadoop
Please try  -D dfs.block.size=4096000
The specification must be in bytes.


On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup soett...@nbi.dkwrote:

 Hi all,

 I have a job that creates very big local files so i need to split it to as
 many mappers as possible. Now the DFS block size I'm
 using means that this job is only split to 3 mappers. I don't want to
 change the hdfs wide block size because it works for my other jobs.

 Is there a way to give a specific file a different block size. The
 documentation says it is, but does not explain how.
 I've tried:
 hadoop dfs -D dfs.block.size=4M -put file  /dest/

 But that does not work.

 any help would be apreciated.

 Cheers,
 Chrulle




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Sorting on several columns using KeyFieldSeparator and Paritioner

2009-05-05 Thread jason hadoop
You must only have 3 fields in your keys.
Try this - it is my best guess based on your code. Appendix A of my book has
a detailed discussion of these fields and the gotchas, and the example code
has test classes that allow you to try different keys with different input
to see how the parts are actually broken out of the keys.

jobConf.set(mapred.text.key.partitioner.options,-k2.2 -k1.1 -k3.3);
jobConf.set(mapred.text.key.comparator.options,-k2.2r -k1.1rn -k3.3n);


On Tue, May 5, 2009 at 2:05 AM, Min Zhou coderp...@gmail.com wrote:

 I came across the same failure.  Anyone can solve this problem?

 On Sun, Jan 18, 2009 at 9:06 AM, Saptarshi Guha saptarshi.g...@gmail.com
 wrote:

  Hello,
  I have  a file with n columns, some which are text and some numeric.
  Given a sequence of indices, i would like to sort on those indices i.e
  first on Index1, then within Index2 and so on.
  In the example code below, i have 3 columns, numeric, text, numeric,
  space separated.
  Sort on 2(reverse), then 1(reverse,numeric) and lastly 3
 
  Though my code runs (and gives wrong results,col 2 is sorted in
  reverse, and within that col3 which is treated as tex and then col1 )
  on the local, when distributed I get a merge error - my guess is
  fixing the latter fixes the former.
 
  This is the error:
  java.io.IOException: Final merge failed
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2093)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457)
 at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)
  Caused by: java.lang.ArrayIndexOutOfBoundsException: 562
 at
 
 org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:128)
 at
 
 org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compareByteSequence(KeyFieldBasedComparator.java:109)
 at
 
 org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compare(KeyFieldBasedComparator.java:85)
 at
  org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:308)
 at
  org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144)
 at
  org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103)
 at
 
 org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:270)
 at
 org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:285)
 at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:108)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2087)
 ... 3 more
 
 
  Thanks for your time
 
  And the code (not too big) is
  ==CODE==
 
  public class RMRSort extends Configured implements Tool {
 
   static class RMRSortMap extends MapReduceBase implements
  MapperLongWritable, Text, Text, Text {
 
 public void map(LongWritable key, Text value,OutputCollectorText,
  Text output, Reporter reporter)  throws IOException {
 output.collect(value,value);
 }
   }
 
 static class RMRSortReduce extends MapReduceBase implements
  ReducerText, Text, NullWritable, Text {
 
 public void reduce(Text key, IteratorText
  values,OutputCollectorNullWritable, Text output, Reporter reporter)
  throws IOException {
 NullWritable n = NullWritable.get();
 while(values.hasNext())
 output.collect(n,values.next() );
 }
 }
 
 
 static JobConf createConf(String rserveport,String uid,String
  infolder, String outfolder)
 Configuration defaults = new Configuration();
 JobConf jobConf = new JobConf(defaults, RMRSort.class);
 jobConf.setJobName(Sorter: +uid);
 jobConf.addResource(new
  Path(System.getenv(HADOOP_CONF_DIR)+/hadoop-site.xml));
  //  jobConf.set(mapred.job.tracker, local);
 jobConf.setMapperClass(RMRSortMap.class);
 jobConf.setReducerClass(RMRSortReduce.class);
 jobConf.set(map.output.key.field.separator,fsep);
 jobConf.setPartitionerClass(KeyFieldBasedPartitioner.class);
 jobConf.set(mapred.text.key.partitioner.options,-k2,2 -k1,1
  -k3,3);
 
  jobConf.setOutputKeyComparatorClass(KeyFieldBasedComparator.class);
 jobConf.set(mapred.text.key.comparator.options,-k2r,2r
 -k1rn,1rn
  -k3n,3n);
  //infolder, outfolder information removed
 jobConf.setMapOutputKeyClass(Text.class);
 jobConf.setMapOutputValueClass(Text.class);
 jobConf.setOutputKeyClass(NullWritable.class);
 return(jobConf);
 }
 public int run(String[] args) throws Exception {
 return(1);
 }
 
  }
 
 
 
 
  --
  Saptarshi Guha - saptarshi.g...@gmail.com
 



 --
 My research interests are distributed systems, parallel computing and
 bytecode based virtual machine.

 My profile:
 http://www.linkedin.com/in/coderplay
 My blog:
 http://coderplay.javaeye.com




-- 

Re: specific block size for a file

2009-05-05 Thread jason hadoop
Trade off between hdfs efficiency and data locality.


On Tue, May 5, 2009 at 9:37 AM, Arun C Murthy a...@yahoo-inc.com wrote:


 On May 5, 2009, at 4:47 AM, Christian Ulrik Søttrup wrote:

  Hi all,

 I have a job that creates very big local files so i need to split it to as
 many mappers as possible. Now the DFS block size I'm
 using means that this job is only split to 3 mappers. I don't want to
 change the hdfs wide block size because it works for my other jobs.


 I would rather keep the big files on HDFS and use -Dmapred.min.split.size
 to get more maps to process your data

 http://hadoop.apache.org/core/docs/r0.20.0/mapred_tutorial.html#Job+Input

 Arun




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals


Re: Global object for a map task

2009-05-03 Thread jason hadoop
If it is relatively small you can pass it via the JobConf object, storing a
serialized version of your dataset.
If it is larger you can pass a serialized version via the distributed cache.
Your map task will need to deserialize the object in the configure method.

None of the above methods give you an object that is write shared between
map tasks.

Please remember that the map tasks execute in separate JVM's on distinct
machines in the normal MapReduce environment.



On Sat, May 2, 2009 at 10:59 PM, Amandeep Khurana ama...@gmail.com wrote:

 How can I create a global variable for each node running my map task. For
 example, a common ArrayList that my map function can access for every k,v
 pair it works on. It doesnt really need to create the ArrayList everytime.

 If I create it in the main function of the job, the map function gets a
 null
 pointer exception. Where else can this be created?

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Global object for a map task

2009-05-03 Thread jason hadoop
define some key of yours, say my.app.array,
serialize the object to a string, say myObjectString

Then conf.set(my.app.array, myObjectString)

Then in your Mapper.configure() method
String myObjectString = conf.get(my.app.array)
deserialize your object.

A google search on java object serialization to string, will provide you
with many examples, such as
http://www.velocityreviews.com/forums/showpost.php?p=3185744postcount=10.



On Sat, May 2, 2009 at 11:56 PM, Amandeep Khurana ama...@gmail.com wrote:

 Thanks Jason.

 My object is relatively small. But how do I pass it via the JobConf object?
 Can you elaborate a bit...



 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz


 On Sat, May 2, 2009 at 11:53 PM, jason hadoop jason.had...@gmail.com
 wrote:

  If it is relatively small you can pass it via the JobConf object, storing
 a
  serialized version of your dataset.
  If it is larger you can pass a serialized version via the distributed
  cache.
  Your map task will need to deserialize the object in the configure
 method.
 
  None of the above methods give you an object that is write shared between
  map tasks.
 
  Please remember that the map tasks execute in separate JVM's on distinct
  machines in the normal MapReduce environment.
 
 
 
  On Sat, May 2, 2009 at 10:59 PM, Amandeep Khurana ama...@gmail.com
  wrote:
 
   How can I create a global variable for each node running my map task.
 For
   example, a common ArrayList that my map function can access for every
 k,v
   pair it works on. It doesnt really need to create the ArrayList
  everytime.
  
   If I create it in the main function of the job, the map function gets a
   null
   pointer exception. Where else can this be created?
  
   Amandeep
  
  
   Amandeep Khurana
   Computer Science Graduate Student
   University of California, Santa Cruz
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Sequence of Streaming Jobs

2009-05-02 Thread jason hadoop
if you are using the sh or bash, the variable $? holds the exit status of
the last command to execute.

hadoop jar streaming.jar ...
if [ $? -ne 0 ]; then
echo My job failed 21
exit 1
fi

Caution $? is the very last command to execute's exit status. It is easy to
run another command before testing and then test the wrong command's exit
status



On Sat, May 2, 2009 at 11:56 AM, Mayuran Yogarajah 
mayuran.yogara...@casalemedia.com wrote:

 Billy Pearson wrote:

 I done this with and array of commands for the jobs in a php script
 checking
 the return of the job to tell if it failed or not.

 Billy



 I have this same issue.. How do you check if a job failed or not? You
 mentioned checking
 the return code? How are you doing that ?

 thanks

  Dan Milstein dmilst...@hubteam.com wrote in
 message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com...


 If I've got a sequence of streaming jobs, each of which depends on the
 output of the previous one, is there a good way to launch that  sequence?
 Meaning, I want step B to only start once step A has  finished.

 From within Java JobClient code, I can do submitJob/runJob, but is  there
 any sort of clean way to do this for a sequence of streaming jobs?

 Thanks,
 -Dan Milstein











-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: cannot open an hdfs file in O_RDWR mode

2009-05-01 Thread jason hadoop
In hadoop 0.19.1, (and 19.0) libhdfs (which is used by the fuse package for
hdfs access) explicitly denies open requests that pass O_RDWR

If you have binary applications that pass the flag, but would work correctly
given the limitations of HDFS, you may alter the code in
src/c++/libhdfs/hdfs.c to allow it, or build a shared library that you
preload that changes the flags passed to the real open. Hacking hdfs.c is
much simpler.

Line 407 of hdfs.c

jobject jFS = (jobject)fs;

if (flags  O_RDWR) {
  fprintf(stderr, ERROR: cannot open an hdfs file in O_RDWR mode\n);
  errno = ENOTSUP;
  return NULL;
}




On Fri, May 1, 2009 at 6:34 PM, Philip Zeyliger phi...@cloudera.com wrote:

 HDFS does not allow you to overwrite bytes of a file that have already been
 written.  The only operations it supports are read (an existing file),
 write
 (a new file), and (in newer versions, not always enabled) append (to an
 existing file).

 -- Philip

 On Fri, May 1, 2009 at 5:56 PM, Robert Engel enge...@ligo.caltech.edu
 wrote:

  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA1
 
  Hello,
 
 I am using Hadoop on a small storage cluster (x86_64, CentOS 5.3,
  Hadoop-0.19.1). The hdfs is mounted using fuse and everything seemed
  to work just fine so far. However, I noticed that I cannot:
 
  1) use svn to check out files on the mounted hdfs partition
  2) request that stdout and stderr of Globus jobs is written to the
  hdfs partition
 
  In both cases I see following error message in /var/log/messages:
 
  fuse_dfs: ERROR: could not connect open file fuse_dfs.c:1364
 
  When I run fuse_dfs in debugging mode I get:
 
  ERROR: cannot open an hdfs file in O_RDWR mode
  unique: 169, error: -5 (Input/output error), outsize: 16
 
  My question is if this is a general limitation of Hadoop or if this
  operation is just not currently supported? I searched Google and JIRA
  but could not find an answer.
 
  Thanks,
  Robert
 
  -BEGIN PGP SIGNATURE-
  Version: GnuPG v1.4.9 (GNU/Linux)
  Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
 
  iEYEARECAAYFAkn7mksACgkQrxCAtr5BXdMx5wCeICTHQbOwjZoGpVTO6ayd7l7t
  LXoAn0WBwfo6ZYdJX1sh2eO2owAR0HLm
  =PUCc
  -END PGP SIGNATURE-
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: unable to see anything in stdout

2009-05-01 Thread jason hadoop
Less work by skipping setting up the input splits, distributing the job jar
files, scheduling the map tasks on the task trackers, collecting the task
status results, then starting all the reduce tasks, collecting all the
results, sorting them, feeding them to the reduce tasks, then writing them
to hdfs.
No jvm forking etc.
The overhead of co-ordination is not small, but pays off when the jobs are
large and the number of machines are large.

local works great for jobs that really only need the resources of a single
machine and only require a single reduce task.

On Fri, May 1, 2009 at 6:09 PM, Asim linka...@gmail.com wrote:

 Thanks Aaron. That worked! However, when i run everything as local, I
 see everything executing much faster on local as compared to a single
 node. Is there any reason for the same?

 -Asim

 On Thu, Apr 30, 2009 at 9:23 AM, Aaron Kimball aa...@cloudera.com wrote:
  First thing I would do is to run the job in the local jobrunner (as a
 single
  process on your local machine without involving the cluster):
 
  JobConf conf = .
  // set other params, mapper, etc. here
  conf.set(mapred.job.tracker, local); // use localjobrunner
  conf.set(fs.default.name, file:///); // read from local hard disk
  instead of hdfs
 
  JobClient.runJob(conf);
 
 
  This will actually print stdout, stderr, etc. to your local terminal. Try
  this on a single input file. This will let you confirm that it does, in
  fact, write to stdout.
 
  - Aaron
 
  On Thu, Apr 30, 2009 at 9:00 AM, Asim linka...@gmail.com wrote:
 
  Hi,
 
  I am not able to see any job output in userlogs/task_id/stdout. It
  remains empty even though I have many println statements. Are there
  any steps to debug this problem?
 
  Regards,
  Asim
 
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: What's the local heap size of Hadoop? How to increase it?

2009-04-29 Thread jason hadoop
you can also specify it on the command line of your hadoop job
hadoop jar jarfile [main class] -D mapred.child.java.opts=-Xmx800M other
arguments
Note: there is a space between the -D and mapred, and the -D has to come
after the main class specification.

This parameter may also be specified  via
conf.set(mapred.child.java.opts,-Xmx800m); before submitting the job.

On Wed, Apr 29, 2009 at 2:59 PM, Bhupesh Bansal bban...@linkedin.comwrote:

 Hey ,

 Try adding

  property
namemapred.child.java.opts/name
value-Xmx800M -server/value
  /property

 With the right JVM size in your hadoop-site.xml , you will have to copy
 this
 to all mapred nodes and restart the cluster.

 Best
 Bhupesh



 On 4/29/09 2:03 PM, Jasmine (Xuanjing) Huang xjhu...@cs.umass.edu
 wrote:

  Hi, there,
 
  What's the local heap size of Hadoop? I have tried to load a local cache
  file which is composed of 500,000 short phrase, but the task failed. The
  output of Hadoop looks like(com.aliasi.dict.ExactDictionaryChunker is a
  third-party jar package, and the records is organized as a trie
 struction):
 
  java.lang.OutOfMemoryError: Java heap space
  at java.util.HashMap.addEntry(HashMap.java:753)
  at java.util.HashMap.put(HashMap.java:385)
  at
  com.aliasi.dict.ExactDictionaryChunker$TrieNode.getOrCreateDaughter(Ex
  actDictionaryChunker.java:476)
  at
  com.aliasi.dict.ExactDictionaryChunker$TrieNode.add(ExactDictionaryChu
  nker.java:484)
 
  When I reduce the total record number to 30,000. My mapreduce job became
  succeed. So, I have a question, What's the local heap size of Hadoop's
 Java
  Virtual Machine? How to increase it?
 
  Best,
  Jasmine
 




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Load .so library error when Hadoop calls JNI interfaces

2009-04-29 Thread jason hadoop
You need to make sure that the shared library is available on the
tasktracker nodes, either by installing it, or by pushing it around via the
distributed cache



On Wed, Apr 29, 2009 at 8:19 PM, Ian jonhson jonhson@gmail.com wrote:

 Dear all,

 I wrote a plugin codes for Hadoop, which calls the interfaces
 in Cpp-built .so library. The plugin codes are written in java,
 so I prepared a JNI class to encapsulate the C interfaces.

 The java codes can be executed successfully when I compiled
 it and run it standalone. However, it does not work when I embedded
 in Hadoop. The exception shown out is (found in Hadoop logs):


   screen dump  -

 # grep myClass logs/* -r
 logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:Exception in
 thread JVM Runner jvm_200904261632_0001_m_-1217897050 spawned.
 java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.mapred.myClass.myClassfsMount(Ljava/lang/String;)I
 logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:  at
 org.apache.hadoop.mapred.myClass.myClassfsMount(Native Method)
 logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:Exception in
 thread JVM Runner jvm_200904261632_0001_m_-1887898624 spawned.
 java.lang.UnsatisfiedLinkError:
 org.apache.hadoop.mapred.myClass.myClassfsMount(Ljava/lang/String;)I
 logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:  at
 org.apache.hadoop.mapred.myClass.myClassfsMount(Native Method)
 ...

 

 It seems the library can not be loaded in Hadoop. My codes
 (myClass.java) is like:


 ---  myClass.java  --

 public class myClass
 {

public static final Log LOG =
LogFactory.getLog(org.apache.hadoop.mapred.myClass);


public myClass()   {

try {
//System.setProperty(java.library.path,
 /usr/local/lib);

/* The above line does not work, so I have to
 do something
 * like following line.
 */
addDir(new String(/usr/local/lib));
System.loadLibrary(myclass);
}
catch(UnsatisfiedLinkError e) {
LOG.info( Cannot load library:\n  +
e.toString() );
}
catch(IOException ioe) {
LOG.info( IO error:\n  +
ioe.toString() );
}

}

/* Since the System.setProperty() does not work, I have to add the
 following
 * function to force the path is added in java.library.path
 */
public static void addDir(String s) throws IOException {

try {
Field field =
 ClassLoader.class.getDeclaredField(usr_paths);
 field.setAccessible(true);
String[] paths = (String[])field.get(null);
for (int i = 0; i  paths.length; i++) {
if (s.equals(paths[i])) {
return;
}
}
String[] tmp = new String[paths.length+1];
System.arraycopy(paths,0,tmp,0,paths.length);
tmp[paths.length] = s;

field.set(null,tmp);
} catch (IllegalAccessException e) {
throw new IOException(Failed to get
 permissions to set library path);
} catch (NoSuchFieldException e) {
throw new IOException(Failed to get field
 handle to set library path);
}
}

public native int myClassfsMount(String subsys);
public native int myClassfsUmount(String subsys);


 }

 


 I don't know what missed in my codes and am wondering whether there are any
 rules in Hadoop I should obey if I want to  achieve my target.

 FYI, the myClassfsMount() and myClassfsUmount() will open a socket to call
 services from a daemon. I would better if this design did not cause the
 fail in
 my codes.


 Any comments?


 Thanks in advance,

 Ian




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: streaming but no sorting

2009-04-28 Thread jason hadoop
It may be simpler to just have a post processing step that uses something
like multi-file input to aggregate the results.

As a complete sideways thinking solution, I suspect you have far more map
tasks than you have physical machines,
instead of writing your output via output.collect, your tasks could open a
'side effect file' and append to it, since these are in the local file
system you actually have the ability to append to them. You will need to
play some interesting games with the OutputCommitter though.

An alternative would be to write N output records, where N is the number of
reduces, where each of the N keys is guaranteed to to a unique reduce task,
and the value of the record is the local file name and the host name.
The side effect files would need to be written into the job working area or
some public area on the node., rather than the task output area, or the
output committer could place them in the proper place (that way failed tasks
are handled correctly).

The reduce then reads the keys it has, opens and concatinates what files are
on it's machine, and very very little sorting happens.



Each reduce then collects the side effect files that

2009/4/28 Dmitry Pushkarev u...@stanford.edu

 Hi.



 I'm writing streaming based tasks that involves running thousands of
 mappers, after that I want to put all these outputs into small number (say
 30) output files mainly so that disk space will be used more efficiently,
 the way I'm doing it right now is using /bin/cat as reducer and setting
 number of reducers to desired. This involves two highly ineffective (for
 the
 task) steps - sorting and fetching.  Is there a way to get around that?

 Ideally I'd want all mapper outputs to be written to one file, one record
 per line.



 Thanks.



 ---

 Dmitry Pushkarev

 +1-650-644-8988






-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: streaming but no sorting

2009-04-28 Thread jason hadoop
There has to be a simpler way :)


On Tue, Apr 28, 2009 at 9:22 PM, jason hadoop jason.had...@gmail.comwrote:

 It may be simpler to just have a post processing step that uses something
 like multi-file input to aggregate the results.

 As a complete sideways thinking solution, I suspect you have far more map
 tasks than you have physical machines,
 instead of writing your output via output.collect, your tasks could open a
 'side effect file' and append to it, since these are in the local file
 system you actually have the ability to append to them. You will need to
 play some interesting games with the OutputCommitter though.

 An alternative would be to write N output records, where N is the number of
 reduces, where each of the N keys is guaranteed to to a unique reduce task,
 and the value of the record is the local file name and the host name.
 The side effect files would need to be written into the job working area or
 some public area on the node., rather than the task output area, or the
 output committer could place them in the proper place (that way failed tasks
 are handled correctly).

 The reduce then reads the keys it has, opens and concatinates what files
 are on it's machine, and very very little sorting happens.



 Each reduce then collects the side effect files that

 2009/4/28 Dmitry Pushkarev u...@stanford.edu

 Hi.



 I'm writing streaming based tasks that involves running thousands of
 mappers, after that I want to put all these outputs into small number (say
 30) output files mainly so that disk space will be used more efficiently,
 the way I'm doing it right now is using /bin/cat as reducer and setting
 number of reducers to desired. This involves two highly ineffective (for
 the
 task) steps - sorting and fetching.  Is there a way to get around that?

 Ideally I'd want all mapper outputs to be written to one file, one record
 per line.



 Thanks.



 ---

 Dmitry Pushkarev

 +1-650-644-8988






 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: How to write large string to file in HDFS

2009-04-28 Thread jason hadoop
How about new InputStreamReader( new StringReader( String ), UTF-8 )
replace UTF-8 with an appropriate charset.


On Tue, Apr 28, 2009 at 7:47 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote:

 Hi all!


 I have the large String and I want to write it into the file in HDFS.

 (The large string has 100.000 lines.)


 Current, I use method copyBytes of class org.apache.hadoop.io.IOUtils.
 But the copyBytes request the InputStream of content. Therefore, I have
 to convert the String to InputStream, some things like:



InputStream in=new ByteArrayInputStream(sb.toString().getBytes());

The sb is a StringBuffer.


 It not work with the command line above. :(

 There is the error:

 Exception in thread main java.lang.OutOfMemoryError: Java heap space
at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232)
at java.lang.StringCoding.encode(StringCoding.java:272)
at java.lang.String.getBytes(String.java:947)
at asnet.haris.mapred.jobs.Test.main(Test.java:32)



 Please give me the good solution!


 Thanks,


 Best regards,

 Nguyen,






-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: Storing data-node content to other machine

2009-04-27 Thread jason hadoop
There is no requirement that your hdfs and mapred clusters share an
installation directory, it is just done that way because it is simple and
most people have a datanode and tasktracker on each slave node.

Simply have 2 configuration directories on your cluster machines, and us the
bin/start-dfs.sh script in one, and the bin/start-mapred.sh script in the
other, and maintain different slaves files in the two directories.

You will loose the benefit of data locality for your tasktrackers which do
not reside on the datanode machines.

On Sun, Apr 26, 2009 at 10:06 PM, Vishal Ghawate 
vishal_ghaw...@persistent.co.in wrote:

 Hi,
 I want to store the contents of all the client machine(datanode)of hadoop
 cluster to centralized machine
  with high storage capacity.so that tasktracker will be on the client
 machine but the contents are stored on the
 centralized machine.
Can anybody help me on this please.

 DISCLAIMER
 ==
 This e-mail may contain privileged and confidential information which is
 the property of Persistent Systems Ltd. It is intended only for the use of
 the individual or entity to which it is addressed. If you are not the
 intended recipient, you are not authorized to read, retain, copy, print,
 distribute or use this message. If you have received this communication in
 error, please notify the sender and delete all copies of this message.
 Persistent Systems Ltd. does not accept any liability for virus infected
 mails.




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


Re: IO Exception in Map Tasks

2009-04-27 Thread jason hadoop
The jvm had a hard failure and crashed


On Sun, Apr 26, 2009 at 11:34 PM, Rakhi Khatwani
rakhi.khatw...@gmail.comwrote:

 Hi,

  In one of the map tasks, i get the following exception:
  java.io.IOException: Task process exit with nonzero status of 255.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

 java.io.IOException: Task process exit with nonzero status of 255.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424)

 what could be the reason?

 Thanks,
 Raakhi




-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422


  1   2   >