Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Aaron Kimball
Did you run the dfs put commands from the master node?  If you're inserting
into HDFS from a machine running a DataNode, the local datanode will always
be chosen as one of the three replica targets. For more balanced loading,
you should use an off-cluster machine as the point of origin.

If you experience uneven block distribution, you should also periodically
rebalance your cluster by running bin/start-balancer.sh every so often. It
will work in the background to move blocks from heavily-laden nodes to
underutilized ones.

- Aaron

On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
qiming...@openresearchinc.com wrote:


 Hi all

 I dfs put a large dataset onto a 10-node cluster.

 When I observe the Hadoop progress (via web:50070) and each local file
 system (via df -k),
 I notice that my master node is hit 5-10 times harder than others, so hard
 drive is get full quicker than others. Last night load, it actually crash
 when hard drive was full.

 To my understand,  data should wrap around all nodes evenly (in a
 round-robin fashion using 64M as a unit).

 Is it expected behavior of Hadoop? Can anyone suggest a good
 troubleshooting
 way?

 Thanks


 --
 View this message in context:
 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: HDFS is not loading evenly across all nodes.

2009-06-18 Thread Aaron Kimball
As an addendum, running a DataNode on the same machine as a NameNode is
generally considered a bad idea because it hurts the NameNode's ability to
maintain high throughput.

- Aaron

On Thu, Jun 18, 2009 at 1:26 PM, Aaron Kimball aa...@cloudera.com wrote:

 Did you run the dfs put commands from the master node?  If you're inserting
 into HDFS from a machine running a DataNode, the local datanode will always
 be chosen as one of the three replica targets. For more balanced loading,
 you should use an off-cluster machine as the point of origin.

 If you experience uneven block distribution, you should also periodically
 rebalance your cluster by running bin/start-balancer.sh every so often. It
 will work in the background to move blocks from heavily-laden nodes to
 underutilized ones.

 - Aaron


 On Thu, Jun 18, 2009 at 12:57 PM, openresearch 
 qiming...@openresearchinc.com wrote:


 Hi all

 I dfs put a large dataset onto a 10-node cluster.

 When I observe the Hadoop progress (via web:50070) and each local file
 system (via df -k),
 I notice that my master node is hit 5-10 times harder than others, so hard
 drive is get full quicker than others. Last night load, it actually crash
 when hard drive was full.

 To my understand,  data should wrap around all nodes evenly (in a
 round-robin fashion using 64M as a unit).

 Is it expected behavior of Hadoop? Can anyone suggest a good
 troubleshooting
 way?

 Thanks


 --
 View this message in context:
 http://www.nabble.com/HDFS-is-not-loading-evenly-across-all-nodes.-tp24099585p24099585.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.





Re: Trying to setup Cluster

2009-06-18 Thread Aaron Kimball
Are you encountering specific problems?

I don't think that hadoop's config files will evaluate environment
variables. So $HADOOP_HOME won't be interpreted correctly.

For passwordless ssh, see
http://rcsg-gsir.imsb-dsgi.nrc-cnrc.gc.ca/documents/internet/node31.html or
just check the manpage for ssh-keygen.

- Aaron

On Wed, Jun 17, 2009 at 9:30 AM, Divij Durve divij.t...@gmail.com wrote:

 Im trying to setup a cluster with 3 different machines running Fedora. I
 cant get them to log into the localhost without the password but thats the
 least of my worries at the moment.

 I am posting my config files and the master and slave files let me know if
 anyone can spot a problem with the configs...


 Hadoop-site.xml
 ?xml version=1.0?
 ?xml-stylesheet type=text/xsl href=configuration.xsl?

 !-- Put site-specific property overrides in this file. --

 configuration
 property
namedfs.data.dir/name
value$HADOOP_HOME/dfs-data/value
finaltrue/final
  /property

  property
 namedfs.name.dir/name
 value$HADOOP_HOME/dfs-name/value
 finaltrue/final
   /property


 property
  namehadoop.tmp.dir/name
value$HADOOP_HOME/hadoop-tmp/value
  descriptionA base for other temporary directories./description
  /property


 property
  namefs.default.name/name
valuehdfs://gobi.something.something:54310/value
  descriptionThe name of the default file system.  A URI whose
scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl)
 naming
the FileSystem implementation class.  The uri's authority is
 used to
  determine the host, port, etc. for a FileSystem./description
  /property

 property
  namemapred.job.tracker/name
valuekalahari.something.something:54311/value
  descriptionThe host and port that the MapReduce job tracker runs
at.  If local, then jobs are run in-process as a single map
  and reduce task.
/description
/property

  property
 namemapred.system.dir/name
 value$HADOOP_HOME/mapred-system/value
 finaltrue/final
   /property

 property
  namedfs.replication/name
value1/value
  descriptionDefault block replication.
The actual number of replications can be specified when the file is
 created.
  The default is used if replication is not specified in create
 time.
/description
/property


 property
  namemapred.local.dir/name
value$HADOOP_HOME/mapred-local/value
  namedfs.replication/name
value1/value
 /property


 /configuration


 Slave:
 kongur.something.something

 master:
 kalahari.something.something

 i execute the dfs-start.sh command from gobi.something.something.

 is there any other info that i should provide in order to help? Also Kongur
 is where im running the data node the master file on kongur should have
 localhost in it rite? thanks for the help

 Divij



Re: Could I collect results from Map-Reduce then output myself ?

2009-06-15 Thread Aaron Kimball
If you can make the decision locally, then it should just be performed in
the reducer itself:

if (guard) {
  output.collect(k, v);
}


If you need to know what results will be generated by other calls to
reduce() on that same machine, then you'll need to be a bit more clever. If
you know that for all jobs you'll run, your results will always fit in a
buffer in RAM, then you can put your values in an ArrayList or something and
then override Reducer.close() to dump your values into the output collector.
Then call super.close().

If you may need to generate more data than will fit in RAM, or you need the
results of multiple nodes to conference together, then this means you almost
certainly want a second MapReduce pass. Your first pass should collect() all
the results it generates. Then in a second pass, use an identity mapper that
causes the shuffler to sort the data along some axis so that the most
desirable data comes first. Then output.collect() this data a second time in
the second reducer, discarding the data that doesn't meet your criterion.
The input path to your second MR is the output path from the first one.

- Aaron

On Sun, Jun 14, 2009 at 4:02 PM, Kunsheng Chen ke...@yahoo.com wrote:


 Hi everyone,

 I am doing a map-reduce program, it is working good.

 Now I am thinking of inserting my own algorithm to pick the output results
 after 'Reduce' other than simply use 'output.colllect()' in Reduce to output
 all results.

 The only thing I could think is read the output file after JobClient
 finishing and does some Java program for that, but I am not sure whether
 there are efficient method provided by hadoop to handle that.


 Any idea is well appreciated,

 -Kun






Re: how to transfer data from one reduce to another map

2009-06-15 Thread Aaron Kimball
You can add multiple paths to the same job -- FileInputFormat.addInputPath()
can be called multiple times.

- Aaron

On Mon, Jun 15, 2009 at 6:56 AM, bharath vissapragada 
bharathvissapragada1...@gmail.com wrote:

 if your doubt is related to chaining of mapreduce jobs .. then this link
 might be useful ...

 http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining

 On Mon, Jun 15, 2009 at 7:22 PM, HRoger hanxianiyongro...@163.com wrote:

 
  I'm sorry for my confusing description,It is job2 has to use two input
  source
  one from job1's output and another from anywhere.
 
  TimRobertson100 wrote:
  
   Hi
  
   I am not sure I understand the question correctly.  If you mean you
   want to use the output of Job1 as the input of Job2, then you can set
   the input path to the second job as the output path (e.g. output
   directory) from the first job.
  
   Cheers
  
   Tim
  
  
   On Mon, Jun 15, 2009 at 3:30 PM, HRogerhanxianyongro...@163.com
 wrote:
  
   Hi !
   I write a application which has two jobs: the second job use the input
   datasource same as the first job's added the the output(some objects)
 of
   first job.Can I transfer some objects from one job to another job or
  make
   the job has two input source?
   --
   View this message in context:
  
 
 http://www.nabble.com/how-to-transfer-data-from-one-reduce-to-another-map-tp24034706p24034706.html
   Sent from the Hadoop core-user mailing list archive at Nabble.com.
  
  
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/how-to-transfer-data-from-one-reduce-to-another-map-tp24034706p24035057.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: Debugging Map-Reduce programs

2009-06-15 Thread Aaron Kimball
On Mon, Jun 15, 2009 at 10:01 AM, bharath vissapragada 
bhara...@students.iiit.ac.in wrote:

 Hi all ,

 When running hadoop in local mode .. can we use print statements to print
 something to the terminal ...


Yes. In distributed mode, each task will write its stdout/stderr to files
which you can access through the web-based interface.



 Also iam not sure whether the program is reading my input files ... If i
 keep print statements it isn't displaying any .. can anyone tell me how to
 solve this problem.


Is it generating exceptions? Are the files present? If you're running in
local mode, you can use a debugger; set a breakpoint in your map() method
and see if it gets there. How are you configuring the input files for your
job?




 Thanks in adance,



Re: Help - ClassNotFoundException from tasktrackers

2009-06-10 Thread Aaron Kimball
mapred.local.dir can definitely work with multiple paths, but it's a bit
strict about the format. You should have one or more paths, separated by
commas and no whitespace. e.g. /disk1/mapred,/disk2/mapred. If you've got
them on new lines, etc, then it might try to interpret that as part of one
of the paths. Could this be your issue? If you paste your hadoop-site.xml in
to an email we could take a look.

- Aaron

On Wed, Jun 10, 2009 at 2:41 AM, Harish Mallipeddi 
harish.mallipe...@gmail.com wrote:

 Hi,

 After some amount of debugging, I narrowed down the problem. I'd set
 multiple paths (multiple disks mounted at different endpoints) for the *
 mapred.local.dir* setting. When I changed this to just one path, everything
 started working - previously the job jar and the job xml weren't being
 created under the local/taskTracker/jobCache folder.

 Any idea why? Am I doing something wrong? I've read somewhere else that
 specifying multiple disks for mapred.local.dir is important to increase
 disk
 bandwidth so that the map outputs get written faster to the local disks on
 the tasktracker nodes.

 Cheers,
 Harish

 On Tue, Jun 9, 2009 at 5:21 PM, Harish Mallipeddi 
 harish.mallipe...@gmail.com wrote:

  I setup a new cluster (1 namenode + 2 datanodes). I'm trying to run the
  GenericMRLoadGenerator program from hadoop-0.20.0-test.jar. But I keep
  getting the ClassNotFoundException. Any reason why this would be
 happening?
  It seems to me like the tasktrackers cannot find the class files from the
  hadoop program but when you do hadoop jar, it will automatically ship
 the
  job jar file to all the tasktracker nodes and put them in the classpath?
 
  $ *bin/hadoop jar hadoop-0.20.0-test.jar
 loadgen*-Dtest.randomtextwrite.bytes_per_map=1048576
  -Dhadoop.sort.reduce.keep.percent=50.0 -outKey org.apache.hadoop.io.Text
  -outValue org.apache.hadoop.io.Text
 
  No input path; ignoring InputFormat
  Job started: Tue Jun 09 03:16:30 PDT 2009
  09/06/09 03:16:30 INFO mapred.JobClient: Running job:
 job_200906090316_0001
  09/06/09 03:16:31 INFO mapred.JobClient:  map 0% reduce 0%
  09/06/09 03:16:40 INFO mapred.JobClient: Task Id :
  attempt_200906090316_0001_m_00_0, Status : FAILED
  java.io.IOException: Split class
 
 org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
  not found
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:324)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
  at org.apache.hadoop.mapred.Child.main(Child.java:170)
  *Caused by: java.lang.ClassNotFoundException:
 
 org.apache.hadoop.mapred.GenericMRLoadGenerator$IndirectInputFormat$IndirectSplit
  *
  at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
  at java.security.AccessController.doPrivileged(Native Method)
  at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
  at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
  at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
  at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
  at java.lang.Class.forName0(Native Method)
  at java.lang.Class.forName(Class.java:247)
  at
 
 org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:761)
  at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:321)
  ... 2 more
  [more of the same exception from different map tasks...I've removed them]
 
  Actually the same thing happens even if I run the PiEstimator program
 from
  hadoop-0.20.0-examples.jar.
 
  Thanks,
 
  --
  Harish Mallipeddi
  http://blog.poundbang.in
 



 --
 Harish Mallipeddi
 http://blog.poundbang.in



Re: Hadoop benchmarking

2009-06-10 Thread Aaron Kimball
Hi Stephen,

That will set the maximum heap allowable, but doesn't tell Hadoop's internal
systems necessarily to take advantage of it. There's a number of other
settings that adjust performance. At Cloudera we have a config tool that
generates Hadoop configurations with reasonable first-approximation values
for your cluster -- check out http://my.cloudera.com and look at the
hadoop-site.xml it generates. If you start from there you might find a
better parameter space to explore. Please share back your findings -- we'd
love to tweak the tool even more with some external feedback :)

- Aaron


On Wed, Jun 10, 2009 at 7:39 AM, stephen mulcahy
stephen.mulc...@deri.orgwrote:

 Hi,

 I'm currently doing some testing of different configurations using the
 Hadoop Sort as follows,

 bin/hadoop jar hadoop-*-examples.jar randomwriter
 -Dtest.randomwrite.total_bytes=107374182400 /benchmark100

 bin/hadoop jar hadoop-*-examples.jar sort /benchmark100 rand-sort

 The only changes I've made from the standard config are the following in
 conf/mapred-site.xml

   property
 namemapred.child.java.opts/name
 value-Xmx1024M/value
   /property

   property
 namemapred.tasktracker.map.tasks.maximum/name
 value8/value
   /property

   property
 namemapred.tasktracker.reduce.tasks.maximum/name
 value4/value
   /property

 I'm running this on 4 systems, each with 8 processor cores and 4 separate
 disks.

 Is there anything else I should change to stress memory more? The systems
 in questions have 16GB of memory but the most thats getting used during a
 run of this benchmark is about 2GB (and most of that seems to be os
 caching).

 Thanks,

 -stephen

 --
 Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
 NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
 http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com



Re: Large size Text file split

2009-06-10 Thread Aaron Kimball
The FileSplit boundaries are rough edges -- the mapper responsible for the
previous split will continue until it finds a full record, and the next
mapper will read ahead and only start on the first record boundary after the
byte offset.
- Aaron

On Wed, Jun 10, 2009 at 7:53 PM, Wenrui Guo wenrui@ericsson.com wrote:

 I think the default TextInputFormat can meet my requirement. However,
 even if the JavaDoc of TextInputFormat says the TextInputFormat could
 divide input file as text lines which ends with CRLF. But I'd like to
 know if the FileSplit size is not N times of line length, what will be
 happen eventually?

 BR/anderson

 -Original Message-
 From: jason hadoop [mailto:jason.had...@gmail.com]
 Sent: Wednesday, June 10, 2009 8:39 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Large size Text file split

 There is always NLineInputFormat. You specify the number of lines per
 split.
 The key is the position of the line start in the file, value is the line
 itself.
 The parameter mapred.line.input.format.linespermap controls the number
 of lines per split

 On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi 
 harish.mallipe...@gmail.com wrote:

  On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com
  wrote:
 
   Hi, all
  
   I have a large csv file ( larger than 10 GB ), I'd like to use a
   certain InputFormat to split it into smaller part thus each Mapper
   can deal with piece of the csv file. However, as far as I know,
   FileInputFormat only cares about byte size of file, that is, the
   class can divide the csv file as many part, and maybe some part is
 not a well-format CVS file.
   For example, one line of the CSV file is not terminated with CRLF,
   or maybe some text is trimed.
  
   How to ensure each FileSplit is a smaller valid CSV file using a
   proper InputFormat?
  
   BR/anderson
  
 
  If all you care about is the splits occurring at line boundaries, then

  TextInputFormat will work.
 
  http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapre
  d/TextInputFormat.html
 
  If not I guess you can write your own InputFormat class.
 
  --
  Harish Mallipeddi
  http://blog.poundbang.in
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Hadoop scheduling question

2009-06-05 Thread Aaron Kimball
Hi Kristi :)

JobControl, as I understand it, runs on the client's machine and sends Job
objects to the JobTracker when their dependencies are all met. So once the
dependencies are met, they go into the JobTracker's normal scheduler queue.

There are three different scheduler implementations that you can choose
from. All of these support multiple concurrent jobs, if there are more task
slots available than tasks. So if a job contains 50 map tasks, and there are
64 map task slots (e.g., 8 machines with a tasktracker.max.map.tasks == 8)
available, then the whole job will be scheduled, then the first 14 tasks
from the next job can run simultaneously.

The schedulers differ in what happens when a cluster is already loaded up
with work.

The default scheduler is the FIFO scheduler. Jobs are in a queue and all
tasks from job 1 will be satisfied before any tasks from job 2 are
dispatched to task trackers.

There's another scheduler called the FairScheduler. The user configures a
set of pools on the cluster. Each job, when created, is bound to a
specific pool. Each pool is guaranteed a minimum share of the scheduler.
e.g., my pool may guarantee me a minimum of 3 map task slots and 1 reduce
slot. If the whole cluster is loaded up, then the scheduler will prioritise
my tasks up in its dispatch line until I'm getting my 3 task minimum. If
there are multiple jobs which are tagged with different pools, then the fair
scheduler's algorithm balances between them. The aaron pool may have a 3
task minimum. The bob pool may have a 6 task minimum. There might be many
more task slots available than 9. If aaron and bob both submit jobs at once,
then aaron will get 3/9 of the available task slots. bob will get 6/9.  But
all tasks from a given user are usually put in the same pool, so they will
run in the same shared set of resources. Multiple tasks in a pool can run in
parallel if the pool has slots available.

Finally, there's a third scheduler called the Capacity scheduler. It's
similar to the fair scheduler, in that it allows guarantees of minimum
availability for different pools. I don't know how it apportions additional
extra resources though -- this is the one I'm least familiar with. Someone
else will have to chime in here.

- Aaron

On Fri, Jun 5, 2009 at 12:19 PM, Scott Carey sc...@richrelevance.comwrote:

 Even more general context:
 Cascading does something similar, but I am not sure if it uses Hadoop's
 JobControl or manages dependencies itself.  It definitely runs multiple
 jobs
 in parallel when the dependencies allow it.



 On 6/5/09 11:44 AM, Alan Gates ga...@yahoo-inc.com wrote:

  To add a little context, Pig uses Hadoop's JobControl to schedule it's
  jobs.  Pig defines the dependencies between jobs in JobControl, and
  then submits the entire graph of jobs.  So, using JobControl, does
  Hadoop schedule jobs serially or in parallel (assuming no dependencies)?
 
  Alan.
 
  On Jun 5, 2009, at 10:50 AM, Kristi Morton wrote:
 
  Hi Pankil,
 
  Sorry about having to send my question email twice to the list...
  the first time I sent it I had forgotten to subscribe to the list.
  I resent it after subscribing, and your response to the first email
  I sent did not make it into my inbox.  I saw your response on the
  archives list.
 
  So, to recap, you said:
 
  We are not able to carry out all joins in a single job..we also
  tried our hadoop code using
  Pig scripts and found that for each join in PIG script new job is
  used.So
  basically what i think its a sequential process to handle typesof
  join where
  output of one job is required s an input to other one.
 
 
  I, too, have seen this sequential behavior with joins.  However, it
  seems like it could be possible for there to be two jobs executing
  in parallel whose output is the input to the subsequent job.  Is
  this possible or are all jobs scheduled sequentially?
 
  Thanks,
  Kristi
 
 
 




Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Aaron Kimball
Can you give an example of the exact arguments you're sending on the command
line?
- Aaron

On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

 If after I call getConf to get the conf object, I manually add the
 key/value pair, it's there when I need it.  So it feels like ToolRunner
 isn't parsing my args for some reason.

 Ian


 On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote:

  Yes, and I get the JobConf via 'JobConf job = new JobConf(conf,
 the.class)'.  The conf is the Configuration object that comes from getConf.
  Pretty much copied from the WordCount example (which this program used to
 be a long while back...)

 thanks,
 Ian

 On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote:

  Are you running your program via ToolRunner.run()? How do you instantiate
 the JobConf object?
 - Aaron

 On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov
 wrote:
 I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and
 I'm finding that when I run a job and try to pass options with -D on the
 command line, that the option values aren't showing up in my JobConf.  I
 logged all the key/value pairs in the JobConf, and the option I passed
 through with -D isn't there.

 This worked in 0.19.1... did something change with command-line options
 from 18 to 19?

 Thanks,
 Ian







Re: How do I convert DataInput and ResultSet to array of String?

2009-06-04 Thread Aaron Kimball
e.g. for readFields(),

myItems = new ArrayListString();
int numItems = dataInput.readInt();
for (i = 0; i  numItems; i++) {
  myItems.add(Text.readString(dataInput));
}

then on the serialization (write) side, send:

dataOutput.writeInt(myItems.length());
for (int i = 0; i  myItems.length(); i++) {
  new Text(myItems.get(i)).writeString(dataOutput);
}

You should look at the source code for ArrayWritable, IntWritable, and Text
-- specifically their write() and readFields() methods -- to get a feel for
how to write such methods for your own types.

- Aaron


On Wed, Jun 3, 2009 at 4:19 PM, dealmaker vin...@gmail.com wrote:


 Would you provide a code sample of it?  I don't know how to do serializer
 in
 hadoop.

 I am using the following class as type of my value object:

  private static class StringArrayWritable extends ArrayWritable {
private StringArrayWritable (String [] aSString) {
  super (aSString);
}
  }

 Thanks.


 Aaron Kimball-3 wrote:
 
  The text serializer will pull out an entire string by using a null
  terminator at the end.
 
  If you need to know the number of string objects, though, you'll have to
  serialize that before the strings, then use a for loop to decode the rest
  of
  them.
  - Aaron
 
  On Tue, Jun 2, 2009 at 6:01 PM, dealmaker vin...@gmail.com wrote:
 
 
  Thanks.  The number of elements in this array of String is unknown until
  run
  time.  If datainput treats it as a byte array, I still have to know the
  size
  of each String.  How do I do that?   Would you suggest some code samples
  or
  links that deal with similar situation like this?  The only examples I
  got
  are the ones about counting number of words which deal with integers.
  Thanks.
 
 
  Aaron Kimball-3 wrote:
  
   Hi,
  
   You can't just turn either of these two types into arrays of strings
   automatically, because they are interfaces to underlying streams of
  data.
   You are required to know what protocol you are implementing -- i.e.,
  how
   many fields you are transmitting -- and manually read through that
 many
   fields yourself. For example, a DataInput object is effectively a
  pointer
   into a byte array. There may be many records in that byte array, but
  you
   only want to read the fields of the first record out.
  
   For DataInput / DataOutput, you can UTF8-decode the next field by
  calling
   Text.readString(dataInput) and Text.writeString(dataOutput).
   For ResultSet, you want resultSet.getString(fieldNum)
  
   As (yet another) shameless plug ( :smile: ), check out the tool we
 just
   released, which automates database import tasks. It auto-generates the
   classes necessary for your tables, too.
   http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/
  
   At the very least, you might want to play with it a bit and read its
   source
   code so you have a better idea of how to implement your own class
  (since
   you're doing some more creative stuff like building up associative
  arrays
   for each field).
  
   Cheers,
   - Aaron
  
   On Mon, Jun 1, 2009 at 9:53 PM, dealmaker vin...@gmail.com wrote:
  
  
   bump.  Does anyone know?
  
   I am using the following class of arraywritable:
  
private static class StringArrayWritable extends ArrayWritable {
  private StringArrayWritable (String [] aSString) {
super (aSString);
   }
}
  
  
   dealmaker wrote:
   
Hi,
  How do I convert DataInput to array of String?
  How do I convert ResultSet to array of String?
Thanks.  Following is the code:
   
  static class Record implements Writable, DBWritable {
String [] aSAssoc;
   
public void write(DataOutput arg0) throws IOException {
  throw new UnsupportedOperationException(Not supported
 yet.);
}
   
public void readFields(DataInput in) throws IOException {
  this.aSAssoc = // How to convert DataInput to String Array?
}
   
public void write(PreparedStatement arg0) throws SQLException {
  throw new UnsupportedOperationException(Not supported
 yet.);
}
   
public void readFields(ResultSet rs) throws SQLException {
  this.aSAssoc = // How to convert ResultSet to String Array?
}
  }
   
   
  
   --
   View this message in context:
  
 
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23826464.html
   Sent from the Hadoop core-user mailing list archive at Nabble.com.
  
  
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23843679.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23861270.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Do I need to implement Readfields and Write Functions If I have Only One Field?

2009-06-04 Thread Aaron Kimball
If you don't add any member fields, then no, I don't think you need to
change anything.
- Aaron

On Wed, Jun 3, 2009 at 4:11 PM, dealmaker vin...@gmail.com wrote:


 I have the following as my type of my value object.  Do I need to
 implement
 readfields and write functions?

  private static class StringArrayWritable extends ArrayWritable {
private StringArrayWritable (String [] aSString) {
  super (aSString);
 }
  }


 Aaron Kimball-3 wrote:
 
  If you can use an existing serializeable type to hold that field (e.g.,
 if
  it's an integer, then use IntWritable) then you can just get away with
  that.
  If you are specifying your own class for a key or value class, then yes,
  the
  class must implement readFields() and write().
 
  There's no concept of introspection or other magic to determine how to
  serialize types by decomposing them into more primitive types - you must
  manually insert the logic for this in every class you use.
 
  - Aaron
 
  On Wed, Jun 3, 2009 at 2:41 PM, dealmaker vin...@gmail.com wrote:
 
 
  Hi,
   I have only one field for the record.  I wonder if I even need to
 define
  a
  member variable in class Record.  Do I even need to implement readfields
  and
  write functions if I have only one field?  Can I just use the value
  object
  directly instead of a member variable of value object?
  Thanks.
  --
  View this message in context:
 
 http://www.nabble.com/Do-I-need-to-implement-Readfields-and-Write-Functions-If-I-have-Only-One-Field--tp23860009p23860009.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/Do-I-need-to-implement-Readfields-and-Write-Functions-If-I-have-Only-One-Field--tp23860009p23861181.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Subscription

2009-06-04 Thread Aaron Kimball
You need to send a message to core-user-subscr...@hadoop.apache.org from the
address you want registered.
See http://hadoop.apache.org/core/mailing_lists.html

- Aaron

On Thu, Jun 4, 2009 at 12:10 PM, Akhil langer akhilan...@gmail.com wrote:

 Please, add me to the hadoop-core user mailing list.

 email address: *akhilan...@gmail.com*

 Thank You!
 Akhil



Re: How do I convert DataInput and ResultSet to array of String?

2009-06-03 Thread Aaron Kimball
The text serializer will pull out an entire string by using a null
terminator at the end.

If you need to know the number of string objects, though, you'll have to
serialize that before the strings, then use a for loop to decode the rest of
them.
- Aaron

On Tue, Jun 2, 2009 at 6:01 PM, dealmaker vin...@gmail.com wrote:


 Thanks.  The number of elements in this array of String is unknown until
 run
 time.  If datainput treats it as a byte array, I still have to know the
 size
 of each String.  How do I do that?   Would you suggest some code samples or
 links that deal with similar situation like this?  The only examples I got
 are the ones about counting number of words which deal with integers.
 Thanks.


 Aaron Kimball-3 wrote:
 
  Hi,
 
  You can't just turn either of these two types into arrays of strings
  automatically, because they are interfaces to underlying streams of data.
  You are required to know what protocol you are implementing -- i.e., how
  many fields you are transmitting -- and manually read through that many
  fields yourself. For example, a DataInput object is effectively a pointer
  into a byte array. There may be many records in that byte array, but you
  only want to read the fields of the first record out.
 
  For DataInput / DataOutput, you can UTF8-decode the next field by calling
  Text.readString(dataInput) and Text.writeString(dataOutput).
  For ResultSet, you want resultSet.getString(fieldNum)
 
  As (yet another) shameless plug ( :smile: ), check out the tool we just
  released, which automates database import tasks. It auto-generates the
  classes necessary for your tables, too.
  http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/
 
  At the very least, you might want to play with it a bit and read its
  source
  code so you have a better idea of how to implement your own class (since
  you're doing some more creative stuff like building up associative arrays
  for each field).
 
  Cheers,
  - Aaron
 
  On Mon, Jun 1, 2009 at 9:53 PM, dealmaker vin...@gmail.com wrote:
 
 
  bump.  Does anyone know?
 
  I am using the following class of arraywritable:
 
   private static class StringArrayWritable extends ArrayWritable {
 private StringArrayWritable (String [] aSString) {
   super (aSString);
  }
   }
 
 
  dealmaker wrote:
  
   Hi,
 How do I convert DataInput to array of String?
 How do I convert ResultSet to array of String?
   Thanks.  Following is the code:
  
 static class Record implements Writable, DBWritable {
   String [] aSAssoc;
  
   public void write(DataOutput arg0) throws IOException {
 throw new UnsupportedOperationException(Not supported yet.);
   }
  
   public void readFields(DataInput in) throws IOException {
 this.aSAssoc = // How to convert DataInput to String Array?
   }
  
   public void write(PreparedStatement arg0) throws SQLException {
 throw new UnsupportedOperationException(Not supported yet.);
   }
  
   public void readFields(ResultSet rs) throws SQLException {
 this.aSAssoc = // How to convert ResultSet to String Array?
   }
 }
  
  
 
  --
  View this message in context:
 
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23826464.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23843679.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Do I need to implement Readfields and Write Functions If I have Only One Field?

2009-06-03 Thread Aaron Kimball
If you can use an existing serializeable type to hold that field (e.g., if
it's an integer, then use IntWritable) then you can just get away with that.
If you are specifying your own class for a key or value class, then yes, the
class must implement readFields() and write().

There's no concept of introspection or other magic to determine how to
serialize types by decomposing them into more primitive types - you must
manually insert the logic for this in every class you use.

- Aaron

On Wed, Jun 3, 2009 at 2:41 PM, dealmaker vin...@gmail.com wrote:


 Hi,
  I have only one field for the record.  I wonder if I even need to define a
 member variable in class Record.  Do I even need to implement readfields
 and
 write functions if I have only one field?  Can I just use the value
 object
 directly instead of a member variable of value object?
 Thanks.
 --
 View this message in context:
 http://www.nabble.com/Do-I-need-to-implement-Readfields-and-Write-Functions-If-I-have-Only-One-Field--tp23860009p23860009.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Hadoop ReInitialization.

2009-06-03 Thread Aaron Kimball
You can block for safemode exit by running 'hadoop dfsadmin -safemode wait'
rather than sleeping for an arbitrary amount of time.

More generally, I'm a bit confused what you mean by all this. Hadoop daemons
may individually crash, but you should never need to reformat HDFS and start
from scratch. If you're doing this, that means that you're probably sticking
some important hadoop files in a temp dir that's getting cleaned out or
something of the like. Are dfs.data.dir and dfs.name.dir suitably
well-protected from tmpwatch or other such housekeeping programs?

- Aaron

On Wed, Jun 3, 2009 at 4:50 AM, Steve Loughran ste...@apache.org wrote:

 b wrote:

  But after formatting and starting DFS i need to wait some time (sleep 60)
 before putting data into HDFS. Else i will receive
 NotReplicatedYetException.


 that means the namenode is up but there aren't enough workers yet.



Re: Command-line jobConf options in 0.18.3

2009-06-03 Thread Aaron Kimball
Are you running your program via ToolRunner.run()? How do you instantiate
the JobConf object?
- Aaron

On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff ian.sobor...@nist.gov wrote:

 I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long story), and
 I'm finding that when I run a job and try to pass options with -D on the
 command line, that the option values aren't showing up in my JobConf.  I
 logged all the key/value pairs in the JobConf, and the option I passed
 through with -D isn't there.

 This worked in 0.19.1... did something change with command-line options
 from 18 to 19?

 Thanks,
 Ian




Re: How do I convert DataInput and ResultSet to array of String?

2009-06-02 Thread Aaron Kimball
Hi,

You can't just turn either of these two types into arrays of strings
automatically, because they are interfaces to underlying streams of data.
You are required to know what protocol you are implementing -- i.e., how
many fields you are transmitting -- and manually read through that many
fields yourself. For example, a DataInput object is effectively a pointer
into a byte array. There may be many records in that byte array, but you
only want to read the fields of the first record out.

For DataInput / DataOutput, you can UTF8-decode the next field by calling
Text.readString(dataInput) and Text.writeString(dataOutput).
For ResultSet, you want resultSet.getString(fieldNum)

As (yet another) shameless plug ( :smile: ), check out the tool we just
released, which automates database import tasks. It auto-generates the
classes necessary for your tables, too.
http://www.cloudera.com/blog/2009/06/01/introducing-sqoop/

At the very least, you might want to play with it a bit and read its source
code so you have a better idea of how to implement your own class (since
you're doing some more creative stuff like building up associative arrays
for each field).

Cheers,
- Aaron

On Mon, Jun 1, 2009 at 9:53 PM, dealmaker vin...@gmail.com wrote:


 bump.  Does anyone know?

 I am using the following class of arraywritable:

  private static class StringArrayWritable extends ArrayWritable {
private StringArrayWritable (String [] aSString) {
  super (aSString);
 }
  }


 dealmaker wrote:
 
  Hi,
How do I convert DataInput to array of String?
How do I convert ResultSet to array of String?
  Thanks.  Following is the code:
 
static class Record implements Writable, DBWritable {
  String [] aSAssoc;
 
  public void write(DataOutput arg0) throws IOException {
throw new UnsupportedOperationException(Not supported yet.);
  }
 
  public void readFields(DataInput in) throws IOException {
this.aSAssoc = // How to convert DataInput to String Array?
  }
 
  public void write(PreparedStatement arg0) throws SQLException {
throw new UnsupportedOperationException(Not supported yet.);
  }
 
  public void readFields(ResultSet rs) throws SQLException {
this.aSAssoc = // How to convert ResultSet to String Array?
  }
}
 
 

 --
 View this message in context:
 http://www.nabble.com/How-do-I-convert-DataInput-and-ResultSet-to-array-of-String--tp23770747p23826464.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Pseudo-distributed mode problem

2009-06-02 Thread Aaron Kimball
Glad you got it sorted out
- Aaron

On Tue, Jun 2, 2009 at 12:49 AM, Vasyl Keretsman vasi...@gmail.com wrote:

 Sorry it was my fault.

 Instead of starting my job as bin/hadoop jar job.jar I ran it as
 bin/hadoop -cp job.jar.

 I thought it would be the same.

 Thanks anyway

 Vasyl

 2009/6/2 Aaron Kimball aa...@cloudera.com:
  Can you post the contents of your hadoop-site.xml file here?
  - Aaron
 
  On Sat, May 30, 2009 at 2:44 AM, Vasyl Keretsman vasi...@gmail.com
 wrote:
 
  Hi all,
 
  I am just getting started with hadoop 0.20 and trying to run a job in
  pseudo-distributed mode.
 
  I configured hadoop according to the tutorial, but it seems it does
  not work as expected.
 
  My map/reduce tasks are running sequencially and output result is
  stored on local filesystem instead of the dfs space.
  Job tracker does not see the running job at all.
  I have checked the logs but don't see any errors either. I have also
  copied some files manually to the dfs to make sure it works.
 
  The only difference between the manual and my configuration is that I
  had to change the ports for the job tracker and namenode as 9000 and
  9001 are already used by other apps on my workstation.
 
  Any hints?
 
  Thanks
 
  Regards,
 
  Vasyl
 
 



Re: Subdirectory question revisited

2009-06-02 Thread Aaron Kimball
There is no technical limit that prevents Hadoop from operating in this
fashion; it's simply the case that the included InputFormat implementations
do not do so. This behavior has been set in this fashion for a long time, so
it's unlikely that it will change soon, as that might break existing
applications.

But you can write your own subclass of TextInputFormat or
SequenceFileInputFormat that overrides the getSplits() method to recursively
descend through directories and search for files.

- Aaron

On Tue, Jun 2, 2009 at 1:22 PM, David Rosenstrauch dar...@darose.netwrote:

 As per a previous list question (
 http://mail-archives.apache.org/mod_mbox/hadoop-core-user/200804.mbox/%3ce75c02ef0804011433x144813e6x2450da7883de3...@mail.gmail.com%3e)
 it looks as though it's not possible for hadoop to traverse input
 directories recursively in order to discover input files.

 Just wondering a) if there's any particular reason why this functionality
 doesn't exist, and b) if not, if there's any workaround/hack to make it
 possible.

 Like the OP, I was thinking it would be helpful to partition my input data
 by year, month, and day.  I figured his would enable me to run jobs against
 specific date ranges of input data, and thereby speed up the execution of my
 jobs since they wouldn't have to process every single record.

 Any way to make this happen?  (Or am I totally going about this the wrong
 way for what I'm trying to achieve?)

 TIA,

 DR



Re: start-all.sh not work in hadoop-0.20.0

2009-06-01 Thread Aaron Kimball
This exception is logged with a severity level of INFO. I think this is a
relatively common exception. The JobTracker should just wait until the DFS
exits safemode and then clearing the system directory will proceed as usual.
So I don't think that this is killing your JobTracker.

Can you confirm that the JobTracker process, in fact, does die? If so,
there's probably an exception lower down in the log marked with a severity
level of ERROR or FATAL -- do you see any of those?

- Aaron


On Mon, Jun 1, 2009 at 7:22 AM, Nick Cen cenyo...@gmail.com wrote:

 Hi All,

 I can start the hadoop by typing start-dfs.sh and start-mapred.sh. But when
 i use the start-all.sh only the hdfs is start. The  jobtracker's log
 indicate there is an exception when starting the job-tracker.

 2009-06-01 22:06:59,675 INFO org.apache.hadoop.mapred.JobTracker: problem
 cleaning system directory:
 hdfs://localhost:54310/tmp/hadoop-nick/mapred/system
 org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot delete
 /tmp/hadoop-nick/mapred/system. Name node is in safe mode.
 The ratio of reported blocks 0. has not reached the threshold 0.9990.
 Safe mode will be turned off automatically.
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.deleteInternal(FSNamesystem.java:1681)
at

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.delete(FSNamesystem.java:1661)
at
 org.apache.hadoop.hdfs.server.namenode.NameNode.delete(NameNode.java:517)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

at org.apache.hadoop.ipc.Client.call(Client.java:739)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
at $Proxy4.delete(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at

 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy4.delete(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.delete(DFSClient.java:550)
at

 org.apache.hadoop.hdfs.DistributedFileSystem.delete(DistributedFileSystem.java:227)
at org.apache.hadoop.mapred.JobTracker.init(JobTracker.java:1637)
at
 org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:174)
at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:3528)


 is this exception cause by no time delay between the start-dfs.sh and the
 start-mapred.sh script? Thanks.


 --
 http://daily.appspot.com/food/



Re: what is the efficient way to implement InputFormat

2009-06-01 Thread Aaron Kimball
The major problem with this model is that an InputSplit is the unit of work
that comprises a map task. So no map tasks can be created before you know
how many InputSplits there are, and what their nature is. This means that
the creation of input splits is performed on the client. This is inherently
single-threaded (or at least, single-node); and if the client machine is
physically far away from the compute cluster, requires that the cluster pass
a lot of data to a single node. Given that MapReduce is designed to allow
for reading files in parallel via many data-local map tasks, this is
something of an anti-pattern. You want to delay reading the data files until
after the task starts.

To read through your data in a high-performance way, you're going to have to
rethink how you compute the input work units. Small tweaks to your data
format may be all you need.

If one of your two files has lines or other delimited records which
reference the other file, maybe you can use the regular TextInputFormat on
the first file, and then just do HDFS reads on the second file from within
your mappers.

Remember, the TextInputFormat doesn't magically know where newlines are. It
accepts that there's fuzziness. It creates splits based purely on block
boundaries, but the borders are a bit elastic. If a line continues past the
end of a block boundary, the previous task will keep reading to the end of
the line. And the next task seeks to the block boundary, but then doesn't
use any text until it finds a newline and has aligned on a new record.

If your data format can be modified so that you can put in some sort of
record delimiters in one of the files, then you can leverage similar
behavior in your InputFormat.

Or else if you know how many splits you are going to have ahead of time
(Even without knowing their exact contents), you can just write out a file
that contains the numbers 1 ... N where N is the number of splits you have.
Then use NLineInputFormat to read this file. Each mapper then gets its
split id as its official record, then you use the FileSystem interface to
access the i'th split of data from within map task i (0  i = N).


I hope these suggestions help.
Good luck,
- Aaron



On Mon, Jun 1, 2009 at 4:57 PM, Kunsheng Chen ke...@yahoo.com wrote:


 Hi Sun,

 Sounds you are dealing with I met before.

 The format of my file is something like this, there is only a space between
 each element

 source destination degree timestamp.

 for example:

 http://www.google.com http://something.net 3 1234555


 The source is my key for map and reduce,

 I just use   'String [] splits = value.toString().split( );' to split
 everything.


 Maybe you are looking for something more complicated, hope this helps.





 --- On Mon, 6/1/09, Zhengguo 'Mike' SUN zhengguo...@yahoo.com wrote:

  From: Zhengguo 'Mike' SUN zhengguo...@yahoo.com
  Subject: what is the efficient way to implement InputFormat
  To: core-user@hadoop.apache.org
  Date: Monday, June 1, 2009, 8:52 PM
  Hi, All,
 
  The input of my MapReduce job is two large txt files. And
  an InputSplit consists of a portion of the file from both
  files. And this Split is content dependent. So I have to
  read the input file to generate a split. Now the thing is
  that most of the time is spent in generating these splits.
  The Map and Reduce phases actually take less time than that.
  I was wondering if there is an efficient way to generate
  splits from files. My InputFormat class is based on
  FileInputFormat. The getSplits function of FileInputFormat
  doesn't read input file. But this is impossible for me
  because my split depends on the content of the file.
 
  Any ideas or comments are appreciated.
 
 
 






Re: Distributed Lucene Questions

2009-06-01 Thread Aaron Kimball
According to the JIRA issue mentioned, it doesn't seem to be production
ready. The author claims it is a prototype... not ready for inclusion. The
patch was posted over a year ago, and there's been no further work or
discussion. You might want to email the author Mark Butler (
https://issues.apache.org/jira/secure/ViewProfile.jspa?name=butlermh)
directly.

This probably renders the performance data question moot.

In general, if you're using Hadoop and HDFS to serve some content, updates
will need to be performed by rewriting a whole index. So frequent updates
are going to be troublesome. Likely what will happen is that updates will
need to be batched up, rewritten to new index files, and then those will be
installed in place of the outdated ones. I haven't read their design doc,
though, so they might do something different, but since HDFS doesn't allow
for modification of closed files, it'll be challenging to be more clever
than that.

You might want to go with option (1) and investigate using something like
memcached, etc, to manage interactive query load.

- Aaron

On Mon, Jun 1, 2009 at 9:54 AM, Tarandeep Singh tarand...@gmail.com wrote:

 Hi All,

 I am trying to build a distributed system to build and serve lucene
 indexes.
 I came across the Distributed Lucene project-
 http://wiki.apache.org/hadoop/DistributedLucene
 https://issues.apache.org/jira/browse/HADOOP-3394

 and have a couple of questions. It will be really helpful if someone can
 provide some insights.

 1) Is this code production ready?
 2) Does someone has performance data for this project?
 3) It allows searches and updates/deletes to be performed at the same time.
 How well the system will perform if there are frequent updates to the
 system. Will it handle the search and update load easily or will it be
 better to rebuild or update the indexes on different machines and then
 deploy the indexes back to the machines that are serving the indexes?

 Basically I am trying to choose between the 2 approaches-

 1) Use Hadoop to build and/or update Lucene indexes and then deploy them on
 separate cluster that will take care or load balancing, fault tolerance
 etc.
 There is a package in Hadoop contrib that does this, so I can use that
 code.

 2) Use and/or modify the Distributed Lucene code.

 I am expecting daily updates to our index so I am not sure if Distribtued
 Lucene code (which allows searches and updates on the same indexes) will be
 able to handle search and update load efficiently.

 Any suggestions ?

 Thanks,
 Tarandeep



Re: Pseudo-distributed mode problem

2009-06-01 Thread Aaron Kimball
Can you post the contents of your hadoop-site.xml file here?
- Aaron

On Sat, May 30, 2009 at 2:44 AM, Vasyl Keretsman vasi...@gmail.com wrote:

 Hi all,

 I am just getting started with hadoop 0.20 and trying to run a job in
 pseudo-distributed mode.

 I configured hadoop according to the tutorial, but it seems it does
 not work as expected.

 My map/reduce tasks are running sequencially and output result is
 stored on local filesystem instead of the dfs space.
 Job tracker does not see the running job at all.
 I have checked the logs but don't see any errors either. I have also
 copied some files manually to the dfs to make sure it works.

 The only difference between the manual and my configuration is that I
 had to change the ports for the job tracker and namenode as 9000 and
 9001 are already used by other apps on my workstation.

 Any hints?

 Thanks

 Regards,

 Vasyl



Re: Username in Hadoop cluster

2009-05-26 Thread Aaron Kimball
A slightly longer answer:

If you're starting daemons with bin/start-dfs.sh or start-all.sh, you'll
notice that these defer to hadoop-daemons.sh to do the heavy lifting. This
evaluates the string: cd $HADOOP_HOME \; $bin/hadoop-daemon.sh --config
$HADOOP_CONF_DIR $@ and passes it to an underlying loop to execute on all
the slaves via ssh.

$bin and $HADOOP_HOME are thus macro-replaced on the server-side. The more
problematic one here is the $bin one, which resolves to the absolute path of
the cwd on the server that is starting Hadoop.

You've got three basic options:
1) Install Hadoop in the exact same path on all nodes
2) Modify bin/hadoop-daemons.sh to do something more clever on your system
by deferring evaluation of HADOOP_HOME and the bin directory (probably
really hairy; you might have to escape the variable names more than once
since there's another script named slaves.sh that this goes through)
3) Start the slaves manually on each node by logging in yourself, and
doing a cd $HADOOP_HOME  bin/hadoop-daemon.sh datanode start

As a shameless plug, Cloudera's distribution for Hadoop (
www.cloudera.com/hadoop) will also provide init.d scripts so that you can
start Hadoop daemons via the 'service' command. By default, the RPM
installation will also standardize on the hadoop username. But you can't
install this without being root.

- Aaron

On Tue, May 26, 2009 at 12:30 PM, Alex Loddengaard a...@cloudera.comwrote:

 It looks to me like you didn't install Hadoop consistently across all
 nodes.

 xxx.xx.xx.251: bash:
  /home/utdhadoop1/Hadoop/

 hadoop-0.18.3/bin/hadoop-daemon.sh: No such file or
 directory

 The above makes me suspect that xxx.xx.xx.251 has Hadoop installed
 differently.  Can you try and locate hadoop-daemon.sh on xxx.xx.xx.251 and
 adjust its location properly?

 Alex

 On Mon, May 25, 2009 at 10:25 PM, Pankil Doshi forpan...@gmail.com
 wrote:

  Hello,
 
  I tried adding usern...@hostname for eachentry in slaves file.
 
  My slave file have 2 data nodes.it looks like below
 
  localhost
  utdhado...@xxx.xx.xx.229
  utdhad...@xxx.xx.xx.251
 
 
  error what I get when i start dfs is as below:
 
  starting namenode, logging to
 
 
 /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-namenode-opencirrus-992.hpl.hp.com.out
  xxx.xx.xx.229: starting datanode, logging to
 
 
 /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-datanode-opencirrus-992.hpl.hp.com.out
  *xxx.xx.xx.251: bash: line 0: cd:
  /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/..: No such file or directory
  xxx.xx.xx.251: bash:
  /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/hadoop-daemon.sh: No such file
 or
  directory
  *localhost: datanode running as process 25814. Stop it first.
  xxx.xx.xx.229: starting secondarynamenode, logging to
 
 
 /home/utdhadoop1/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-utdhadoop1-secondarynamenode-opencirrus-992.hpl.hp.com.out
  localhost: secondarynamenode running as process 25959. Stop it first.
 
 
 
  Basically it looks for * /home/utdhadoop1/Hadoop/**
  hadoop-0.18.3/bin/hadoop-**daemon.sh
  but instead it should look for /home/utdhadoop/Hadoop/ as
  xxx.xx.xx.251 has username as utdhadoop* .
 
  Any inputs??
 
  Thanks
  Pankil
 
  On Wed, May 20, 2009 at 6:30 PM, Todd Lipcon t...@cloudera.com wrote:
 
   On Wed, May 20, 2009 at 4:14 PM, Alex Loddengaard a...@cloudera.com
   wrote:
  
First of all, if you can get all machines to have the same user, that
   would
greatly simplify things.
   
If, for whatever reason, you absolutely can't get the same user on
 all
machines, then you could do either of the following:
   
1) Change the *-all.sh scripts to read from a slaves file that has
 two
fields: a host and a user
  
  
   To add to what Alex said, you should actually already be able to do
 this
   with the existing scripts by simply using the format usern...@hostname
 
   for
   each entry in the slaves file.
  
   -Todd
  
 



Re: Question regarding dwontime model for DataNodes

2009-05-26 Thread Aaron Kimball
A DataNode is regarded as dead after 10 minutes of inactivity. If a
DataNode is down for  10 minutes, and quickly returns, then nothing
happens. After 10 minutes, its blocks are assumed to be lost forever. So the
NameNode then begins scheduling those blocks for re-replication from the two
surviving copies. There is no real-time bound on how long this process
takes. It's a function of your available network bandwidth, server loads,
disk speeds, amount of storage used, etc. Blocks which are down to a single
surviving replica are prioritized over blocks which have two surviving
replicas (in the event of a simultaneous or near-simultaneous double fault),
since they are more vulnerable.

If a DataNode does reappear, then its re-replication is cancelled, and
over-provisioned blocks are scaled back to the target number of replicas. If
that machine comes back with all blocks intact, then the node needs no
rebuilding. (In fact, some of the over-provisioned replicas that get
removed might be from the original node, if they're available elsewhere
too!)

Don't forget that machines in Hadoop do not have strong notions of identity.
If a particular machine is taken offline and its disks are wiped, the blocks
which were there (which also existed in two other places) will be
re-replicated elsewhere from the live copies. When that same machine is then
brought back online, it has no incentive to copy back all the blocks that
it used to have, as there will be three replicas elsewhere in the cluster.
Blocks are never permanently bound to particular machines.

If you add recommissed or new nodes to a cluster, you should run the
rebalancing script which will take a random sampling of blocks from
heavily-laden nodes and move them onto emptier nodes in an attempt to spread
the data as evenly as possible.

- Aaron


On Tue, May 26, 2009 at 3:08 PM, Joe Hammerman jhammer...@videoegg.comwrote:

 Hello Hadoop Users list:

We are running Hadoop version 0.18.2. My team lead has asked
 me to investigate the answer to a particular question regarding Hadoop's
 handling of offline DataNodes - specifically, we would like to know how long
 a node can be offline before it is totally rebuilt when it has been readded
 to the cluster.
From what I've been able to determine from the documentation
 it appears to me that the NameNode will simply begin scheduling block
 replication on its remaining cluster members. If the offline node comes back
 online, and it reports all its blocks as being uncorrupted, then the
 NameNode just cleans up the extra blocks.
In other words, there is no explicit handling based on the
 length of the outage - the behavior of the cluster will depend entirely on
 the outage duration.

Anyone care to shed some light on this?

Thanks!
 Regards,
Joseph Hammerman



Re: Runtime error for wordcount-like example.

2009-05-25 Thread Aaron Kimball
You said the folder may_24 might be there -- but you need to provide the
full name of a class, not just a package. A folder denotes a package. So is
there a may_24.class file? Or is the main() method in some may_24/Foo.class?
If so, then you need to run bin/hadoop jar hadoop_out_degree.jar may_24.Foo
(args...)

Creating the jar using Sun's jar program should automatically insert the
necessary manifest.

What does 'jar tf hadoop_out_degree.jar' return?
- Aaron

On Sun, May 24, 2009 at 4:54 PM, Edward J. Yoon edwardy...@apache.orgwrote:

 It looks like that you miss some packages (xxx.xxx.xx.may_24). Check
 the package names.

 If not, try to run the hadoop_out_degree.jar with main class.

 On Mon, May 25, 2009 at 8:36 AM, Kunsheng Chen ke...@yahoo.com wrote:
 
  Hello everyone,
 
  I am getting some issues when compiling a map-reduce program similar to
 wordcount example.
 
  I followed the instruction in the tutorial and succeeded to compile a jar
 file.
 
  The problem won't appear until I run it as below:
 
  had...@lobo:/osr/Projects/anansi/mrjobs/src$
 /osr/Projects/anansi/hadoop/bin/hadoop jar ./hadoop_out_degree.jar may_24
 outout1/
  java.lang.ClassNotFoundException: may_24
 at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
 at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:247)
 at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
 at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
 
 
  The folder 'may_24' is definitely there, I am not sure whether I missing
 something when compile, do I have to add manifest file in the class folders
 ?
 
  also I am confused about the error info.  Could anyone give me some idea
 ?
 
 
  Thanks in advance,
 
  -KUn
 
 
 
 
 



 --
 Best Regards, Edward J. Yoon @ NHN, corp.
 edwardy...@apache.org
 http://blog.udanax.org



Re: Specifying NameNode externally to hadoop-site.xml

2009-05-25 Thread Aaron Kimball
Same way.

Configuration conf = new Configuration();
conf.set(fs.default.name, hdfs://foo);
FileSystem fs = FileSystem.get(conf);

- Aaron

On Mon, May 25, 2009 at 1:02 PM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 And if I don't use jobs but only DFS for now?

 Regards.

 2009/5/25 jason hadoop jason.had...@gmail.com

  conf.set(fs.default.name, hdfs://host:port);
  where conf is the JobConf object of your job, before you submit it.
 
 
  On Mon, May 25, 2009 at 10:16 AM, Stas Oskin stas.os...@gmail.com
 wrote:
 
   Hi.
  
   Thanks for the tip, but is it possible to set this in dynamic way via
  code?
  
   Thanks.
  
   2009/5/25 jason hadoop jason.had...@gmail.com
  
if you launch your jobs via bin/hadoop jar jar_file [main class]
[options]
   
you can simply specify -fs hdfs://host:port before the jar_file
   
On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com
   wrote:
   
 Hi.

 I'm looking to move the Hadoop NameNode URL outside the
  hadoop-site.xml
 file, so I could set it at the run-time.

 Any idea how to do it?

 Or perhaps there is another configuration that can be applied to
 the
 FileSystem object?

 Regards.

   
   
   
--
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals
   
  
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals
 



Re: how to do pairwise reduce in multiple steps?!

2009-05-25 Thread Aaron Kimball
Tim,

There is not a direct way to set up an arbitrary-depth chain of reduces as
part of the same job. You have two basic options:

1) Do a level of mapping followed by a single level of your tree reduce.
Then follow up with an IdentityMapper forwarding your data to a second level
of your tree reduce. Repeat this part until you're done.

2) Leverage the sorted nature of keys being sent into your reducer to
topographically sort your results before they are reduced up pairwise. The
reducer task which received a set of keys receives all keys that return the
same partition number from the Partitioner. The default partitioner returns
a partition id of key.hashCode() % numReduceTasks. So by playing with the
key's hashCode() method, you could force elements of the same subtree to go
to the same reducer. The reducer will then crunch a set of keys and their
related values, in sorted order by the compareTo() method of the keys.

e.g., for the example below, you could make sure that x, y, z, and w all had
the same key.hashCode() so they arrive at the same reducer; then have their
compareTo() methods sort them in order: x, y, z, w. You can then compute
f(x, y) and store that in memory somewhere (call it 'a'), then compute f(z,
w) and store that in 'b'. Then since you know that you computed a full level
of the tree, you'd go back through your second-level array in the same
reduce task, computing f(a, b), and emitting that to the output collector.
You could also emit sentinel-values from the mapper to your reducers to
denote the end of a level of the tree, so that you don't have to count in
the reducer itself.

The amount of parallellism you can achieve is limited roughly by the amount
of memory you'll need on a task node to buffer the intermediate values.

After each reducer computes an aggregate over a subtree, you'll have only as
many results as you have reduce tasks -- maybe 64 or 128 or so -- at which
point the final aggregation can be done single-threaded in the main program.

It's a bit of a stretch, but you could probably flex MapReduce into doing
this for you.

- Aaron


On Mon, May 25, 2009 at 12:19 AM, Tim Kiefer tim-kie...@gmx.de wrote:

 Hi everyone,

 I do have an application that requires to combine/reduce results in pairs -
 and the results of this step again in pairs. The binary way ;). In case it
 is not clear what is meant, have a look at the schema:

 Assume to have 4 inputs (in my case from an earlier MR job). Now the first
 two inputs need to be combined and the second two. The results need to be
 combined again.

 x ---
|- x*y ---
 y ---  |
  |- x*y*z*w
 z ---  |
|- z*w ---
 w ---

 Note that building x*y from x and y is an expensive operation which is why
 I want to use 2 nodes to combine the 4 inputs in the first step. It is not
 an option to collect all inputs and do the combinations on 1 node.

 I guess I basically want to know whether there is a way to have logN reduce
 steps in a row to combine all inputs to just one final output?!

 Any suggestions are appreciated :)

 - Tim





Re: input/output error while setting up superblock

2009-05-22 Thread Aaron Kimball
More specifically:

HDFS does not support operations such as opening a file for write/append
after it has already been closed, or seeking to a new location in a writer.
You can only write files linearly; all other operations will return a not
supported error.

You'll also find that random-access read performance, while implemented, is
not particularly high-throughput. For serving Xen images even in read-only
mode, you'll likely have much better luck with a different FS.

- Aaron


2009/5/22 Taeho Kang tka...@gmail.com

 I don't think HDFS is a good place to store your Xen image file as it will
 likely be updated/appended frequently in small blocks. With the way HDFS is
 designed for, you can't quite use it like a regular filesystem (e.g. ones
 that support frequent small block appends/updates in files). My suggestion
 is to use another filesystem like NAS or SAN.

 /Taeho

 2009/5/22 신승엽 mikas...@naver.com

  Hi, I have a problem to use hdfs.
 
  I mounted hdfs using fuse-dfs.
 
  I created a dummy file for 'Xen' in hdfs and then formated the dummy file
  using 'mke2fs'.
 
  But the operation was faced error. The error message is as follows.
 
  [r...@localhost hdfs]# mke2fs -j -F ./file_dumy
  mke2fs 1.40.2 (12-Jul-2007)
  ./file_dumy: Input/output error while setting up superblock
  Also, I copyed an image file of xen to hdfs. But Xen couldn't the image
  files in hdfs.
 
  r...@localhost hdfs]# fdisk -l fedora6_demo.img
  last_lba(): I don't know how to handle files with mode 81a4
  You must set cylinders.
  You can do this from the extra functions menu.
 
  Disk fedora6_demo.img: 0 MB, 0 bytes
  255 heads, 63 sectors/track, 0 cylinders
  Units = cylinders of 16065 * 512 = 8225280 bytes
 
Device Boot  Start End  Blocks   Id  System
  fedora6_demo.img1   *   1 156 1253038+  83  Linux
 
  Could you answer me anything about this problem.
 
  Thank you.
 



Re: ssh issues

2009-05-21 Thread Aaron Kimball
Pankil,

That means that either you're using the wrong ssh key and it's falling back
to password authentication, or else you created your ssh keys with
passphrases attached; try making new ssh keys with ssh-keygen and
distributing those to start again?

- Aaron

On Thu, May 21, 2009 at 3:49 PM, Pankil Doshi forpan...@gmail.com wrote:

 The problem is that it also prompts for the pass phrase.

 On Thu, May 21, 2009 at 2:14 PM, Brian Bockelman bbock...@cse.unl.edu
 wrote:

  Hey Pankil,
 
  Use ~/.ssh/config to set the default key location to the proper place for
  each host, if you're going down that route.
 
  I'd remind you that SSH is only used as a convenient method to launch
  daemons.  If you have a preferred way to start things up on your cluster,
  you can use that (I think most large clusters don't use ssh... could be
  wrong).
 
  Brian
 
 
  On May 21, 2009, at 2:07 PM, Pankil Doshi wrote:
 
   Hello everyone,
 
  I got hint how to solve the problem where clusters have different
  usernames.but now other problem I face is that i can ssh a machine by
  using
  -i path/to key/ ..I cant ssh them directly but I will have to always
 pass
  the key.
 
  Now i face problem in ssh-ing my machines.Does anyone have any ideas how
  to
  deal with that??
 
  Regards
  Pankil
 
 
 



Re: difference between bytes read and local bytes read?

2009-05-21 Thread Aaron Kimball
I believe that bytes read is the total volume of data read into the mappers,
whereas local bytes read refers to the bytes read from map tasks which could
be scheduled on machines that already held local copies of their inputs. So
the ratio denotes a rough measure of scheduler locality efficiency.

- Aaron

On Tue, May 19, 2009 at 12:54 PM, Foss User foss...@gmail.com wrote:

 When we see the job details on the job tracker web interface, we see
 bytes read as well as local bytes read. What is the difference
 between the two?



Re: Setting up another machine as secondary node

2009-05-20 Thread Aaron Kimball
See this regarding instructions on configuring a 2NN on a separate machine
from the NN:
http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/

- Aaron

On Thu, May 14, 2009 at 10:42 AM, Koji Noguchi knogu...@yahoo-inc.comwrote:

 Before 0.19, fsimage/edits were on the same directory.
 So whenever secondary finishes checkpointing, it copies back the fsimage
 while namenode still kept on writing to the edits file.

 Usually we observed some latency on the namenode side during that time.

 HADOOP-3948 would probably help after 0.19 or later.

 Koji

 -Original Message-
 From: Brian Bockelman [mailto:bbock...@cse.unl.edu]
 Sent: Thursday, May 14, 2009 10:32 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Setting up another machine as secondary node

 Hey Koji,

 It's an expensive operation - for the secondary namenode, not the
 namenode itself, right?  I don't particularly care if I stress out a
 dedicated node that doesn't have to respond to queries ;)

 Locally we checkpoint+backup fairly frequently (not 5 minutes ...
 maybe less than the default hour) due to sheer paranoia of losing
 metadata.

 Brian

 On May 14, 2009, at 12:25 PM, Koji Noguchi wrote:

  The secondary namenode takes a snapshot
  at 5 minute (configurable) intervals,
 
  This is a bit too aggressive.
  Checkpointing is still an expensive operation.
  I'd say every hour or even every day.
 
  Isn't the default 3600 seconds?
 
  Koji
 
  -Original Message-
  From: jason hadoop [mailto:jason.had...@gmail.com]
  Sent: Thursday, May 14, 2009 7:46 AM
  To: core-user@hadoop.apache.org
  Subject: Re: Setting up another machine as secondary node
 
  any machine put in the conf/masters file becomes a secondary namenode.
 
  At some point there was confusion on the safety of more than one
  machine,
  which I believe was settled, as many are safe.
 
  The secondary namenode takes a snapshot at 5 minute (configurable)
  intervals, rebuilds the fsimage and sends that back to the namenode.
  There is some performance advantage of having it on the local machine,
  and
  some safety advantage of having it on an alternate machine.
  Could someone who remembers speak up on the single vrs multiple
  secondary
  namenodes?
 
 
  On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com
  wrote:
 
  First of all, the secondary namenode is not a what you might think a
  secondary is - it's not failover device.  It does make a copy of the
  filesystem metadata periodically, and it integrates the edits into
  the
  image.  It does *not* provide failover.
 
  Second, you specify its IP address in hadoop-site.xml.  This is where
  you
  can override the defaults set in hadoop-default.xml.
 
  dbr
 
  On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani
  rakhi.khatw...@gmail.com
  wrote:
 
  Hi,
 I wanna set up a cluster of 5 nodes in such a way that
  node1 - master
  node2 - secondary namenode
  node3 - slave
  node4 - slave
  node5 - slave
 
 
  How do we go about that?
  there is no property in hadoop-env where i can set the ip-address
  for
  secondary name node.
 
  if i set node-1 and node-2 in masters, and when we start dfs, in
  both the
  m/cs, the namenode n secondary namenode processes r present. but i
  think
  only node1 is active.
  n my namenode fail over operation fails.
 
  ny suggesstions?
 
  Regards,
  Rakhi
 
 
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
  www.prohadoopbook.com a community for Hadoop Professionals




Re: How to Rename Create Mysql DB Table in Hadoop?

2009-05-20 Thread Aaron Kimball
For your use case, you'll need to just do a ranged import (i.e., SELECT *
FROM foo WHERE id  X and id  Y), and then delete the same records after
the import succeeds (DELETE FROM foo WHERE id  X and id  Y). Before the
import, you can SELECT max(id) FROM foo to establish what Y should be; X is
informed by the previous import operation.

You said that you're concerned with the performance of DELETE, but I don't
know a better way around this if all your input sources are forced to write
to the same table. Ideally you could have a current table and a frozen
table; writes always go to the current table and the import is done from the
frozen table. Then you can DROP TABLE frozen relatively quickly post-import.
At the time of next import you change which table is current and which is
frozen, and repeat. In MySQL you can create updateable views, so you might
want to use a view as an indirection pointer to synchronously change all
your writers from one underlying table to the other.

I'll put a shameless plug here -- I'm developing a tool called sqoop
designed to import from databases into HDFS; patch is available at
http://issues.apache.org/jira/browse/hadoop-5815. It doesn't currently have
support for WHERE clauses, but it's on the roadmap. Please check it out and
let me know what you think.

Cheers,
- Aaron


On Wed, May 20, 2009 at 9:48 AM, dealmaker vin...@gmail.com wrote:


 No, my prime objective is not to backup db.  I am trying to move the
 records
 from mysql db to hadoop for processing.  Hadoop itself doesn't keep any
 records.  After that, I will remove the same mysql records processed in
 hadoop from the mysql db.  The main point isn't about getting the mysql
 records, the main point is removing the same mysql records that are
 processed in hadoop from mysql db.


 Edward J. Yoon-2 wrote:
 
  Oh.. According to my understanding, To maintain a steady DB size,
  delete and backup the old records. If so, I guess you can continuously
  do that using WHERE and LIMIT clauses. Then you can reduce the I/O
  costs..  It should be dumped at once?
 
  On Thu, May 21, 2009 at 12:48 AM, dealmaker vin...@gmail.com wrote:
 
  Other parts of the non-hadoop system will continue to add records to
  mysql db
  when I move  those records (and remove the very same records from mysql
  db
  at the same time) to hadoop for processing.  That's why I am doing those
  mysql commands.
 
  What are you suggesting?  If I do it like you suggest, dump all records
  from
  mysql db to a file in hdfs, how do I remove those very same records from
  the
  mysql db at the same time?  Just rename it first and then dump them and
  then
  read them from the hdfs file?
 
  or should I do it my way?  which way is faster?
  Thanks.
 
 
  Edward J. Yoon-2 wrote:
 
  Hadoop is a distributed filesystem. If you wanted to backup your table
  data to hdfs, you can use SELECT * INTO OUTFILE 'file_name' FROM
  tbl_name; Then, put it to hadoop dfs.
 
  Edward
 
  On Thu, May 21, 2009 at 12:08 AM, dealmaker vin...@gmail.com wrote:
 
  No, actually I am using mysql.  So it doesn't belong to Hive, I think.
 
 
  owen.omalley wrote:
 
 
  On May 19, 2009, at 11:48 PM, dealmaker wrote:
 
 
  Hi,
   I want to backup a table and then create a new empty one with
  following
  commands in Hadoop.  How do I do it in java?  Thanks.
 
  Since this is a question about Hive, you should be asking on
  hive-u...@hadoop.apache.org
  .
 
  -- Owen
 
 
 
  --
  View this message in context:
 
 http://www.nabble.com/How-to-Rename---Create-DB-Table-in-Hadoop--tp23629956p23637131.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 
 
  --
  Best Regards, Edward J. Yoon @ NHN, corp.
  edwardy...@apache.org
  http://blog.udanax.org
 
 
 
  --
  View this message in context:
 
 http://www.nabble.com/How-to-Rename---Create-Mysql-DB-Table-in-Hadoop--tp23629956p23638051.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 
 
  --
  Best Regards, Edward J. Yoon @ NHN, corp.
  edwardy...@apache.org
  http://blog.udanax.org
 
 

 --
 View this message in context:
 http://www.nabble.com/How-to-Rename---Create-Mysql-DB-Table-in-Hadoop--tp23629956p23639294.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Aaron Kimball
I've worked around needing any compile-time dependencies for now. :) No
longer an issue.

- Aaron


On Wed, May 20, 2009 at 10:29 AM, Ashish Thusoo athu...@facebook.comwrote:

 You could either do what Owen suggested and put the plugin in hive contrib,
 or you could just put the whole thing in hive contrib as then you would have
 access to all the lower level api (core, hdfs, hive etc.). Owen's approach
 makes a lot of sense if you think that the hive dependency is a loose one
 and you would have plugins for other systems to achieve your goal. However,
 if this is a hard dependency, then putting it in hive contrib make more
 sense. Either approach is fine, depending upon your goals.

 Ashish

 -Original Message-
 From: Owen O'Malley [mailto:omal...@apache.org]
 Sent: Wednesday, May 20, 2009 5:39 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Linking against Hive in Hadoop development tree


 On May 15, 2009, at 3:25 PM, Aaron Kimball wrote:

  Yikes. So part of sqoop would wind up in one source repository, and
  part in another? This makes my head hurt a bit.

 I'd say rather that Sqoop is in Mapred and the adapter to Hive is in Hive.

  I'm also not convinced how that helps.

 Clearly, what you need to arrange is to not have a compile time dependence
 on Hive. Clearly we don't want cycles in the dependence tree, so you need to
 figure out how to make the adapter for Hive a plugin rather than a part of
 the Sqoop core.

 -- Owen



Linking against Hive in Hadoop development tree

2009-05-15 Thread Aaron Kimball
Hi all,

For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition
to uploading data into HDFS and using MapReduce to load/transform the data,
I'd like to integrate more closely with Hive. Specifically, to run the
CREATE TABLE statements needed to automatically inject table defintions into
Hive's metastore for the data files that sqoop loads into HDFS. Doing this
requires linking against Hive in some way (either directly by using one of
their API libraries, or loosely by piping commands into a Hive instance).

In either case, there's a dependency there. I was hoping someone on this
list with more Ivy experience than I knows what's the best way to make this
happen. Hive isn't in the maven2 repository that Hadoop pulls most of its
dependencies from. It might be necessary for sqoop to have access to a full
build of Hive. It doesn't seem like a good idea to check that binary
distribution into Hadoop svn, but I'm not sure what's the most expedient
alternative. Is it acceptable to just require that developers who wish to
compile/test/run sqoop have a separate standalone Hive deployment and a
proper HIVE_HOME variable? This would keep our source repo clean. The
downside here is that it makes it difficult to test Hive-specific
integration functionality with Hudson and requires extra leg-work of
developers.

Thanks,
- Aaron Kimball


Re: Linking against Hive in Hadoop development tree

2009-05-15 Thread Aaron Kimball
Yikes. So part of sqoop would wind up in one source repository, and part in
another? This makes my head hurt a bit.

I'm also not convinced how that helps. So if I write (e.g.,)
o.a.h.sqoop.HiveImporter and check that into a contrib module in the Hive
project, then the main sqoop program (o.a.h.sqoop.Sqoop) still needs to
compile against/load at runtime o.a.h.s.HiveImporter. So the net result is
the same: building/running a cohesive program requires fetching resources
from the hive repo and compiling them in.

For the moment, though, I'm finding that the Hive JDBC interface is actually
misbehaving more than I care to wrangle with. My current solution is
actually to generate script files and run them with hive -f tmpfilename,
which doesn't require any compile-time linkage. So maybe this is a nonissue
for the moment.

- Aaron

On Fri, May 15, 2009 at 3:06 PM, Owen O'Malley omal...@apache.org wrote:

 On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:

  In either case, there's a dependency there.


 You need to split it so that there are no cycles in the dependency tree. In
 the short term it looks like:

 avro:
 core: avro
 hdfs: core
 mapred: hdfs, core
 hive: mapred, core
 pig: mapred, core

 Adding a dependence from core to hive would be bad. To integrate with Hive,
 you need to add a contrib module to Hive that adds it.

 -- Owen



Re: unable to see anything in stdout

2009-04-30 Thread Aaron Kimball
First thing I would do is to run the job in the local jobrunner (as a single
process on your local machine without involving the cluster):

JobConf conf = .
// set other params, mapper, etc. here
conf.set(mapred.job.tracker, local); // use localjobrunner
conf.set(fs.default.name, file:///); // read from local hard disk
instead of hdfs

JobClient.runJob(conf);


This will actually print stdout, stderr, etc. to your local terminal. Try
this on a single input file. This will let you confirm that it does, in
fact, write to stdout.

- Aaron

On Thu, Apr 30, 2009 at 9:00 AM, Asim linka...@gmail.com wrote:

 Hi,

 I am not able to see any job output in userlogs/task_id/stdout. It
 remains empty even though I have many println statements. Are there
 any steps to debug this problem?

 Regards,
 Asim



Re: Specifying System Properties in the had

2009-04-30 Thread Aaron Kimball
So you want a different -Dfoo=test on each node? It's probably grabbing
the setting from the node where the job was submitted, and this overrides
the settings on each task node.

Try adding finaltrue/final to the property block on the tasktrackers,
then restart Hadoop and try again. This will prevent the job from overriding
the setting.

- Aaron

On Thu, Apr 30, 2009 at 9:25 AM, Marc Limotte mlimo...@feeva.com wrote:

 I'm trying to set a System Property in the Hadoop config, so my jobs will
 know which cluster they are running on.  I think I should be able to do this
 with -Dname=value in mapred.child.java.opts (example below), but the
 setting is ignored.
 In hadoop-site.xml I have:
 property
 namemapred.child.java.opts/name
 value-Xmx200m -Dfoo=test/value
 /property
 But the job conf through the web server indicates:
 mapred.child.java.opts -Xmx1024M -Duser.timezone=UTC

 I'm using Hadoop-0.17.2.1.
 Any tips on why my setting is not picked up?

 Marc

 
 PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR
 ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION
 PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE,
 DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
 PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND
 PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.



Re: Database access in 0.18.3

2009-04-28 Thread Aaron Kimball
Cloudera's Distribution for Hadoop is based on Hadoop 0.18.3 but includes a
backport of HADOOP-2536. You could switch to this distribution instead.
Otherwise, download the 18-branch patch from issue HADOOP-2536 (
http://issues.apache.org/jira/browse/hadoop-2536) and apply it to your local
copy and recompile.

System.out.println() (and System.err) are writen to separate files for each
map and reduce task on each machine; there is not a common console.
- Aaron


On Mon, Apr 27, 2009 at 9:13 PM, rajeev gupta graj1...@yahoo.com wrote:


 I need to write output of reduce in database. There is support for this in
 0.19 but I am using 0.18.3. Any suggestion?

 I tried to process output myself in the reduce() function by writing some
 System.out.println; but it is writing output in userlogs corresponding to
 the map function (intermediate output).

 -rajeev







Re: Database access in 0.18.3

2009-04-28 Thread Aaron Kimball
Should also say, the link to CDH is http://www.cloudera.com/hadoop
- Aaron

On Tue, Apr 28, 2009 at 5:06 PM, Aaron Kimball aa...@cloudera.com wrote:

 Cloudera's Distribution for Hadoop is based on Hadoop 0.18.3 but includes a
 backport of HADOOP-2536. You could switch to this distribution instead.
 Otherwise, download the 18-branch patch from issue HADOOP-2536 (
 http://issues.apache.org/jira/browse/hadoop-2536) and apply it to your
 local copy and recompile.

 System.out.println() (and System.err) are writen to separate files for each
 map and reduce task on each machine; there is not a common console.
 - Aaron



 On Mon, Apr 27, 2009 at 9:13 PM, rajeev gupta graj1...@yahoo.com wrote:


 I need to write output of reduce in database. There is support for this in
 0.19 but I am using 0.18.3. Any suggestion?

 I tried to process output myself in the reduce() function by writing some
 System.out.println; but it is writing output in userlogs corresponding to
 the map function (intermediate output).

 -rajeev








Re: Processing High CPU Memory intensive tasks on Hadoop - Architecture question

2009-04-25 Thread Aaron Kimball
Amit,

This can be made to work with Hadoop. Basically, in your mapper's
configure stage it would do the heavy load-in process, then it would
process your individual work items as records during the actual map stage.
A map task can be comprised of many records, so you'll be fine here.

If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where
multiple map tasks are performed serially in the same JVM instance. In this
case, the first task in the JVM would do the heavy load-in process into
static fields or other globally-accessible items; subsequent tasks could
recognize that the system state is already initialized and would not need to
repeat it.

The number of mapper/reducer tasks that run in parallel on a given node can
be configured with a simple setting; setting this to 6 will work just fine.
The capacity / fairshare schedulers are not what you need here -- their main
function is to ensure that multiple jobs (separate sets of tasks) can all
make progress simultaneously by sharing cluster resources across jobs rather
than running jobs in a FIFO fashion.

- Aaron

On Sat, Apr 25, 2009 at 2:36 PM, amit handa amha...@gmail.com wrote:

 Hi,

 We are planning to use hadoop for some very expensive and long running
 processing tasks.
 The computing nodes that we plan to use are very heavy in terms of CPU and
 memory requirement e.g one process instance takes almost 100% CPU (1 core)
 and around 300 -400 MB of RAM.
 The first time the process loads it can take around 1-1:30 minutes but
 after
 that we can provide the data to process and it takes few seconds to
 process.
 Can I model it on hadoop ?
 Can I have my processes pre-loaded on the task processing machines and the
 data be provided by hadoop? This will save the 1-1:30 minutes of intial
 load
 time that it would otherwise take for each task.
 I want to run a number of these processes in parallel  based on the
 machines
 capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.

 Please let me know if this is possible or any pointers to how it can be
 done
 ?

 Thanks,
 Amit



Re: Advice on restarting HDFS in a cron

2009-04-25 Thread Aaron Kimball
If your logs were being written to the root partition (/dev/sda1), that's
going to fill up fast. This partition is always = 10 GB on EC2 and much of
that space is consumed by the OS install. You should redirect your logs to
some place under /mnt (/dev/sdb1); that's 160 GB.

- Aaron

On Sun, Apr 26, 2009 at 3:21 AM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote:

 Hi,
   I have faced somewhat a similar issue...
   i have a couple of map reduce jobs running on EC2... after a week or so,
 i get a no space on device exception while performing any linux command...
 so end up shuttin down hadoop and hbase, clear the logs and then restart
 them.

 is there a cleaner way to do it???

 thanks
 Raakhi

 On Fri, Apr 24, 2009 at 11:59 PM, Todd Lipcon t...@cloudera.com wrote:

  On Fri, Apr 24, 2009 at 11:18 AM, Marc Limotte mlimo...@feeva.com
 wrote:
 
   Actually, I'm concerned about performance of map/reduce jobs for a
   long-running cluster.  I.e. it seems to get slower the longer it's
  running.
After a restart of HDFS, the jobs seems to run faster.  Not concerned
  about
   the start-up time of HDFS.
  
 
  Hi Marc,
 
  Does it sound like this JIRA describes your problem?
 
  https://issues.apache.org/jira/browse/HADOOP-4766
 
  If so, restarting just the JT should help with the symptoms. (I say
  symptoms
  because this is clearly a problem! Hadoop should be stable and performant
  for months without a cluster restart!)
 
  -Todd
 
 
  
   Of course, as you suggest, this could be poor configuration of the
  cluster
   on my part; but I'd still like to hear best practices around doing a
   scheduled restart.
  
   Marc
  
   -Original Message-
   From: Allen Wittenauer [mailto:a...@yahoo-inc.com]
   Sent: Friday, April 24, 2009 10:17 AM
   To: core-user@hadoop.apache.org
   Subject: Re: Advice on restarting HDFS in a cron
  
  
  
  
   On 4/24/09 9:31 AM, Marc Limotte mlimo...@feeva.com wrote:
I've heard that HDFS starts to slow down after it's been running for
 a
   long
time.  And I believe I've experienced this.
  
   We did an upgrade (== complete restart) of a 2000 node instance in ~20
   minutes on Wednesday. I wouldn't really consider that 'slow', but YMMV.
  
   I suspect people aren't running the secondary name node and therefore
  have
   massively large edits file.  The name node appears slow on restart
  because
   it has to apply the edits to the fsimage rather than having the
 secondary
   keep it up to date.
  
  
   -Original Message-
   From: Marc Limotte
  
   Hi.
  
   I've heard that HDFS starts to slow down after it's been running for a
  long
   time.  And I believe I've experienced this.   So, I was thinking to set
  up a
   cron job to execute every week to shutdown HDFS and start it up again.
  
   In concept, it would be something like:
  
   0 0 0 0 0 $HADOOP_HOME/bin/stop-dfs.sh; $HADOOP_HOME/bin/start-dfs.sh
  
   But I'm wondering if there is a safer way to do this.  In particular:
  
   * What if a map/reduce job is running when this cron hits.  Is
   there a way to suspend jobs while the HDFS restart happens?
  
   * Should I also restart the mapred daemons?
  
   * Should I wait some time after stop-dfs.sh for things to
  settle
   down, before executing start-dfs.sh?  Or maybe I should run a command
  to
   verify that it is stopped before I run the start?
  
   Thanks for any help.
   Marc
  
  
   PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT
 FOR
   ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A
  COMMUNICATION
   PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW,
 USE,
   DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
   PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL
  AND
   PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.
  
 



Re: Processing High CPU Memory intensive tasks on Hadoop - Architecture question

2009-04-25 Thread Aaron Kimball
I'm not aware of any documentation about this particular use case for
Hadoop. I think your best bet is to look into the JNI documentation about
loading native libraries, and go from there.
- Aaron


On Sat, Apr 25, 2009 at 10:44 PM, amit handa amha...@gmail.com wrote:

 Thanks Aaron,

 The processing libs that we use, which take time to load are all c++ based
 .so libs.
 Can i invoke it from JVM during the configure stage of the mapper and keep
 it running as you suggested ?
 Can you point me to some documentation regarding the same ?

 Regards,
 Amit

 On Sat, Apr 25, 2009 at 1:42 PM, Aaron Kimball aa...@cloudera.com wrote:

  Amit,
 
  This can be made to work with Hadoop. Basically, in your mapper's
  configure stage it would do the heavy load-in process, then it would
  process your individual work items as records during the actual map
  stage.
  A map task can be comprised of many records, so you'll be fine here.
 
  If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where
  multiple map tasks are performed serially in the same JVM instance. In
 this
  case, the first task in the JVM would do the heavy load-in process into
  static fields or other globally-accessible items; subsequent tasks could
  recognize that the system state is already initialized and would not need
  to
  repeat it.
 
  The number of mapper/reducer tasks that run in parallel on a given node
 can
  be configured with a simple setting; setting this to 6 will work just
 fine.
  The capacity / fairshare schedulers are not what you need here -- their
  main
  function is to ensure that multiple jobs (separate sets of tasks) can all
  make progress simultaneously by sharing cluster resources across jobs
  rather
  than running jobs in a FIFO fashion.
 
  - Aaron
 
  On Sat, Apr 25, 2009 at 2:36 PM, amit handa amha...@gmail.com wrote:
 
   Hi,
  
   We are planning to use hadoop for some very expensive and long running
   processing tasks.
   The computing nodes that we plan to use are very heavy in terms of CPU
  and
   memory requirement e.g one process instance takes almost 100% CPU (1
  core)
   and around 300 -400 MB of RAM.
   The first time the process loads it can take around 1-1:30 minutes but
   after
   that we can provide the data to process and it takes few seconds to
   process.
   Can I model it on hadoop ?
   Can I have my processes pre-loaded on the task processing machines and
  the
   data be provided by hadoop? This will save the 1-1:30 minutes of intial
   load
   time that it would otherwise take for each task.
   I want to run a number of these processes in parallel  based on the
   machines
   capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.
  
   Please let me know if this is possible or any pointers to how it can be
   done
   ?
  
   Thanks,
   Amit
  
 



Re: When will hadoop 0.19.2 be released?

2009-04-24 Thread Aaron Kimball
In general, there is no way to do an automated downgrade of HDFS metadata :\
If you're up an HDFS version, I'm afraid you're stuck there. The only real
way to downgrade requires that you have enough free space to distcp from one
cluster to the other. If you have 100 TB of free space (!!) then that's
easy, if time-consuming. If not, then you'll have to be a bit more clever.
e.g., by downreplicating all the files first and storing them in
downreplicated fashion on the destination hdfs instance until you're done,
then upping the replication on the destination cluster after the source
cluster has been drained.

Waiting for 0.19.2 might be the better call here.

- Aaron

On Fri, Apr 24, 2009 at 3:12 PM, Zhou, Yunqing azure...@gmail.com wrote:

 But there are already 100TB data stored on DFS.
 Is there a safe solution to do such a downgrade?

 On Fri, Apr 24, 2009 at 2:08 PM, jason hadoop jason.had...@gmail.com
 wrote:
  You could try the cloudera release based on 18.3, with many backported
  features.
  http://www.cloudera.com/distribution
 
  On Thu, Apr 23, 2009 at 11:06 PM, Zhou, Yunqing azure...@gmail.com
 wrote:
 
  currently I'm managing a 64-nodes hadoop 0.19.1 cluster with 100TB data.
  and I found 0.19.1 is buggy and I have already applied some patches on
  hadoop jira to solve problems.
  But I'm looking forward to a more stable release of hadoop.
  Do you know when will 0.19.2 be released?
 
  Thanks.
 
 
 
 
  --
  Alpha Chapters of my book on Hadoop are available
  http://www.apress.com/book/view/9781430219422
 



Re: which is better Text or Custom Class

2009-04-23 Thread Aaron Kimball
In general, serializing to text and then parsing back into a different
format will always be slower than using a purpose-built class that can
serialize itself. The tradeoff, of course, is that going to text is often
more convenient from a developer-time perspective.

- Aaron

On Mon, Apr 20, 2009 at 2:23 PM, chintan bhatt chin1...@hotmail.com wrote:


 Hi all,
 I want to ask you about the performance difference between using the Text
 class and using a custom Class which implements  Writable interface.

 Lets say in InvertedIndex problem when I emit token and a list of document
 Ids which contains it  , using Text we usually Concat the list of document
 ids with space as a separator  d1 d2 d3 d4 etc..If I need the same values
 in a later step of map reduce, I need to split the value string to get the
 list of all document Ids. Is it not better to use Writable List instead??

 I need to ask it because I am using too many Concats and Splits in my
 project to use documents total tokens count, token frequency in a particular
 document etc..


 Thanks in advance,
 Chintan


 _
 Windows Live Messenger. Multitasking at its finest.
 http://www.microsoft.com/india/windows/windowslive/messenger.aspx


Re: Are SequenceFiles split? If so, how?

2009-04-23 Thread Aaron Kimball
Explicitly controlling your splits will be very challenging. Taking the case
where you have expensive (X) and cheap (C) objects to process, you may have
a file where the records are lined up X C X C X C X X X X X C C C. In this
case, you'll need to scan through the whole file and build splits such that
the lengthy run of expensive objects is broken up into separate splits, but
the run of cheap objects is consolidated. I'm suspicious that you can do
this without scanning through the data (which is what often constitutes the
bulk of a time in a mapreduce program).

But how much data are you using? I would imagine that if you're operating at
the scale where Hadoop makes sense, then the high- and low-cost objects will
-- on average -- balance out and tasks will be roughly evenly proportioned.

In general, I would just dodge the problem by making sure your splits
relatively small compared to the size of your input data. If you have 5
million objects to process, then make each split be roughly equal to say
20,000 of them. Then even if some splits take long to process and others
take a short time, then one CPU may dispatch with a dozen cheap splits in
the same time where one unlucky JVM had to process a single very expensive
split. Now you haven't had to manually balance anything, and you still get
to keep all your CPUs full.

- Aaron


On Mon, Apr 20, 2009 at 11:25 PM, Barnet Wagman b.wag...@comcast.netwrote:

 Thanks Aaron, that really helps.  I probably do need to control the number
 of splits.  My input 'data' consists of  Java objects and their size (in
 bytes) doesn't necessarily reflect the amount of time needed for each map
 operation.   I need to ensure that I have enough map tasks so that all cpus
 are utilized and the job gets done in a reasonable amount of time.
  (Currently I'm creating multiple input files and making them unsplitable,
 but subclassing SequenceFileInputFormat to explicitly control then number of
 splits sounds like a better approach).

 Barnet


 Aaron Kimball wrote:

 Yes, there can be more than one InputSplit per SequenceFile. The file will
 be split more-or-less along 64 MB boundaries. (the actual edges of the
 splits will be adjusted to hit the next block of key-value pairs, so it
 might be a few kilobytes off.)

 The SequenceFileInputFormat regards mapred.map.tasks
 (conf.setNumMapTasks())
 as a hint, not a set-in-stone metric. (The number of reduce tasks, though,
 is always 100% user-controlled.) If you need exact control over the number
 of map tasks, you'll need to subclass it and modify this behavior. That
 having been said -- are you sure you actually need to precisely control
 this
 value? Or is it enough to know how many splits were created?

 - Aaron

 On Sun, Apr 19, 2009 at 7:23 PM, Barnet Wagman b.wag...@comcast.net
 wrote:



 Suppose a SequenceFile (containing keys and values that are
 BytesWritable)
 is used as input. Will it be divided into InputSplits?  If so, what's the
 criteria use for splitting?

 I'm interested in this because I need to control the number of map tasks
 used, which (if I understand it correctly), is equal to the number of
 InputSplits.

 thanks,

 bw










Re: Error reading task output

2009-04-16 Thread Aaron Kimball
Cam,

This isn't Hadoop-specific, it's how Linux treats its network configuration.
If you look at /etc/host.conf, you'll probably see a line that says order
hosts, bind -- this is telling Linux's DNS resolution library to first read
your /etc/hosts file, then check an external DNS server.

You could probably disable local hostfile checking, but that means that
every time a program on your system queries the authoritative hostname for
localhost, it'll go out to the network. You'll probably see a big
performance hit. The better solution, I think, is to get your nodes'
/etc/hosts files squared away. You only need to do so once :)


-- Aaron


On Thu, Apr 16, 2009 at 11:31 AM, Cam Macdonell c...@cs.ualberta.ca wrote:

 Cam Macdonell wrote:


 Hi,

 I'm getting the following warning when running the simple wordcount and
 grep examples.

 09/04/15 16:54:16 INFO mapred.JobClient: Task Id :
 attempt_200904151649_0001_m_19_0, Status : FAILED
 Too many fetch-failures
 09/04/15 16:54:16 WARN mapred.JobClient: Error reading task
 outputhttp://localhost.localdomain:50060/tasklog?plaintext=truetaskid=attempt_200904151649_0001_m_19_0filter=stdout

 09/04/15 16:54:16 WARN mapred.JobClient: Error reading task
 outputhttp://localhost.localdomain:50060/tasklog?plaintext=truetaskid=attempt_200904151649_0001_m_19_0filter=stderr


 The only advice I could find from other posts with similar errors is to
 setup /etc/hosts with all slaves and the host IPs.  I did this, but I still
 get the warning above.  The output seems to come out alright however (I
 guess that's why it is a warning).

 I tried running a wget on the http:// address in the warning message and
 I get the following back

 2009-04-15 16:53:46 ERROR 400: Argument taskid is required.

 So perhaps the wrong task ID is being passed to the http request.  Any
 ideas on what can get rid of these warnings?

 Thanks,
 Cam


 Well, for future googlers, I'll answer my own post.  Watch our for the
 hostname at the end of localhost lines on slaves.  One of my slaves was
 registering itself as localhost.localdomain with the jobtracker.

 Is there a way that Hadoop could be made to not be so dependent on
 /etc/hosts, but on more dynamic hostname resolution?

 Cam



Re: More Replication on dfs

2009-04-16 Thread Aaron Kimball
That setting will instruct future file writes to replicate two-fold. This
has no bearing on existing files; replication can be set on a per-file
basis, so they already have their replications set in the DFS indivdually.

Use the command: bin/hadoop fs -setrep [-R] repl_factor filename...

to change the replication factor for files already in HDFS
- Aaron

On Wed, Apr 15, 2009 at 10:04 PM, Puri, Aseem aseem.p...@honeywell.comwrote:

 Hi
My problem is not that my data is under replicated. I have 3
 data nodes. In my hadoop-site.xml I also set the configuration as:

  property
  namedfs.replication/name
  value2/value
  /property

 But after this also data is replicated on 3 nodes instead of two nodes.

 Now, please tell what can be the problem?

 Thanks  Regards
 Aseem Puri

 -Original Message-
 From: Raghu Angadi [mailto:rang...@yahoo-inc.com]
 Sent: Wednesday, April 15, 2009 2:58 AM
 To: core-user@hadoop.apache.org
 Subject: Re: More Replication on dfs

 Aseem,

 Regd over-replication, it is mostly app related issue as Alex mentioned.

 But if you are concerned about under-replicated blocks in fsck output :

 These blocks should not stay under-replicated if you have enough nodes
 and enough space on them (check NameNode webui).

 Try grep-ing for one of the blocks in NameNode log (and datnode logs as
 well, since you have just 3 nodes).

 Raghu.

 Puri, Aseem wrote:
  Alex,
 
  Ouput of $ bin/hadoop fsck / command after running HBase data insert
  command in a table is:
 
  .
  .
  .
  .
  .
  /hbase/test/903188508/tags/info/4897652949308499876:  Under replicated
  blk_-5193
  695109439554521_3133. Target Replicas is 3 but found 1 replica(s).
  .
  /hbase/test/903188508/tags/mapfiles/4897652949308499876/data:  Under
  replicated
  blk_-1213602857020415242_3132. Target Replicas is 3 but found 1
  replica(s).
  .
  /hbase/test/903188508/tags/mapfiles/4897652949308499876/index:  Under
  replicated
   blk_3934493034551838567_3132. Target Replicas is 3 but found 1
  replica(s).
  .
  /user/HadoopAdmin/hbase table.doc:  Under replicated
  blk_4339521803948458144_103
  1. Target Replicas is 3 but found 2 replica(s).
  .
  /user/HadoopAdmin/input/bin.doc:  Under replicated
  blk_-3661765932004150973_1030
  . Target Replicas is 3 but found 2 replica(s).
  .
  /user/HadoopAdmin/input/file01.txt:  Under replicated
  blk_2744169131466786624_10
  01. Target Replicas is 3 but found 2 replica(s).
  .
  /user/HadoopAdmin/input/file02.txt:  Under replicated
  blk_2021956984317789924_10
  02. Target Replicas is 3 but found 2 replica(s).
  .
  /user/HadoopAdmin/input/test.txt:  Under replicated
  blk_-3062256167060082648_100
  4. Target Replicas is 3 but found 2 replica(s).
  ...
  /user/HadoopAdmin/output/part-0:  Under replicated
  blk_8908973033976428484_1
  010. Target Replicas is 3 but found 2 replica(s).
  Status: HEALTHY
   Total size:48510226 B
   Total dirs:492
   Total files:   439 (Files currently being written: 2)
   Total blocks (validated):  401 (avg. block size 120973 B) (Total
  open file
  blocks (not validated): 2)
   Minimally replicated blocks:   401 (100.0 %)
   Over-replicated blocks:0 (0.0 %)
   Under-replicated blocks:   399 (99.50124 %)
   Mis-replicated blocks: 0 (0.0 %)
   Default replication factor:2
   Average block replication: 1.3117207
   Corrupt blocks:0
   Missing replicas:  675 (128.327 %)
   Number of data-nodes:  2
   Number of racks:   1
 
 
  The filesystem under path '/' is HEALTHY
  Please tell what is wrong.
 
  Aseem
 
  -Original Message-
  From: Alex Loddengaard [mailto:a...@cloudera.com]
  Sent: Friday, April 10, 2009 11:04 PM
  To: core-user@hadoop.apache.org
  Subject: Re: More Replication on dfs
 
  Aseem,
 
  How are you verifying that blocks are not being replicated?  Have you
  ran
  fsck?  *bin/hadoop fsck /*
 
  I'd be surprised if replication really wasn't happening.  Can you run
  fsck
  and pay attention to Under-replicated blocks and Mis-replicated
  blocks?
  In fact, can you just copy-paste the output of fsck?
 
  Alex
 
  On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem
  aseem.p...@honeywell.comwrote:
 
  Hi
 I also tried the command $ bin/hadoop balancer. But still the
  same problem.
 
  Aseem
 
  -Original Message-
  From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
  Sent: Friday, April 10, 2009 11:18 AM
  To: core-user@hadoop.apache.org
  Subject: RE: More Replication on dfs
 
  Hi Alex,
 
 Thanks for sharing your knowledge. Till now I have three
  machines and I have to check the behavior of Hadoop so I want
  replication factor should be 2. I started my Hadoop server with
  replication factor 3. After that I upload 3 files to implement word
  count program. But as my all files are stored on one machine and
  replicated to other datanodes also, so my map reduce program takes
  input
  from one Datanode 

Re: Using 3rd party Api in Map class

2009-04-15 Thread Aaron Kimball
That certainly works, though if you plan to upgrade the underlying library,
you'll find that copying files with the correct versions into
$HADOOP_HOME/lib rapidly gets tedious, and subtle mistakes (e.g., forgetting
one machine) can lead to frustration.

When you consider the fact that you're using a Hadoop cluster to process and
transfer around GBs of data on the low end, the difference between a 10 MB
and a 20 MB job jar starts to look meaningless. Putting other jars in a lib/
directory inside your job jar keeps the version consistent and doesn't
clutter up a shared directory on your cluster (assuming there are other
users).

- Aaron

On Tue, Apr 14, 2009 at 11:15 AM, Farhan Husain russ...@gmail.com wrote:

 Hello,

 I got another solution for this. I just pasted all the required jar files
 in
 lib folder of each hadoop node. In this way the job jar is not too big and
 will require less time to distribute in the cluster.

 Thanks,
 Farhan

 On Mon, Apr 13, 2009 at 7:22 PM, Nick Cen cenyo...@gmail.com wrote:

  create a directroy call 'lib' in your project's root dir, then put all
 the
  3rd party jar in it.
 
  2009/4/14 Farhan Husain russ...@gmail.com
 
   Hello,
  
   I am trying to use Pellet library for some OWL inferencing in my mapper
   class. But I can't find a way to bundle the library jar files in my job
  jar
   file. I am exporting my project as a jar file from Eclipse IDE. Will it
   work
   if I create the jar manually and include all the jar files Pellet
 library
   has? Is there any simpler way to include 3rd party library jar files in
 a
   hadoop job jar? Without being able to include the library jars I am
  getting
   ClassNotFoundException.
  
   Thanks,
   Farhan
  
 
 
 
  --
  http://daily.appspot.com/food/
 



Re: Map-Reduce Slow Down

2009-04-15 Thread Aaron Kimball
/192.168.0.18:54310. Already tried 1 time(s).
  2009-04-14 10:08:37,155 INFO org.apache.hadoop.ipc.Client: Retrying
 connect
  to server: node18/192.168.0.18:54310. Already tried 2 time(s).
 
 
  Hmmm I still cant figure it out..
 
  Mithila
 
 
  On Tue, Apr 14, 2009 at 10:22 PM, Mithila Nagendra mnage...@asu.edu
 wrote:
 
  Also, Would the way the port is accessed change if all these node are
  connected through a gateway? I mean in the hadoop-site.xml file? The
 Ubuntu
  systems we worked with earlier didnt have a gateway.
  Mithila
 
  On Tue, Apr 14, 2009 at 9:48 PM, Mithila Nagendra mnage...@asu.edu
 wrote:
 
  Aaron: Which log file do I look into - there are alot of them. Here s
  what the error looks like:
  [mith...@node19:~]$ cd hadoop
  [mith...@node19:~/hadoop]$ bin/hadoop dfs -ls
  09/04/14 10:09:29 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 0 time(s).
  09/04/14 10:09:30 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 1 time(s).
  09/04/14 10:09:31 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 2 time(s).
  09/04/14 10:09:32 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 3 time(s).
  09/04/14 10:09:33 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 4 time(s).
  09/04/14 10:09:34 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 5 time(s).
  09/04/14 10:09:35 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 6 time(s).
  09/04/14 10:09:36 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 7 time(s).
  09/04/14 10:09:37 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 8 time(s).
  09/04/14 10:09:38 INFO ipc.Client: Retrying connect to server: node18/
  192.168.0.18:54310. Already tried 9 time(s).
  Bad connection to FS. command aborted.
 
  Node19 is a slave and Node18 is the master.
 
  Mithila
 
 
 
  On Tue, Apr 14, 2009 at 8:53 PM, Aaron Kimball aa...@cloudera.com
 wrote:
 
  Are there any error messages in the log files on those nodes?
  - Aaron
 
  On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra mnage...@asu.edu
  wrote:
 
   I ve drawn a blank here! Can't figure out what s wrong with the
 ports.
  I
   can
   ssh between the nodes but cant access the DFS from the slaves - says
  Bad
   connection to DFS. Master seems to be fine.
   Mithila
  
   On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra mnage...@asu.edu
 
   wrote:
  
Yes I can..
   
   
On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky 
 jim.twen...@gmail.com
   wrote:
   
Can you ssh between the nodes?
   
-jim
   
On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra 
  mnage...@asu.edu
wrote:
   
 Thanks Aaron.
 Jim: The three clusters I setup had ubuntu running on them and
  the dfs
was
 accessed at port 54310. The new cluster which I ve setup has
 Red
  Hat
Linux
 release 7.2 (Enigma)running on it. Now when I try to access the
  dfs
   from
 one
 of the slaves i get the following response: dfs cannot be
  accessed.
   When
I
 access the DFS throught the master there s no problem. So I
 feel
  there
   a
 problem with the port. Any ideas? I did check the list of
 slaves,
  it
looks
 fine to me.

 Mithila




 On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky 
  jim.twen...@gmail.com
 wrote:

  Mithila,
 
  You said all the slaves were being utilized in the 3 node
  cluster.
Which
  application did you run to test that and what was your input
  size?
   If
you
  tried the word count application on a 516 MB input file on
 both
cluster
  setups, than some of your nodes in the 15 node cluster may
 not
  be
running
  at
  all. Generally, one map job is assigned to each input split
 and
  if
   you
 are
  running your cluster with the defaults, the splits are 64 MB
  each. I
got
  confused when you said the Namenode seemed to do all the
 work.
  Can
   you
  check
  conf/slaves and make sure you put the names of all task
  trackers
there? I
  also suggest comparing both clusters with a larger input
 size,
  say
   at
 least
  5 GB, to really see a difference.
 
  Jim
 
  On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball 
  aa...@cloudera.com
 wrote:
 
   in hadoop-*-examples.jar, use randomwriter to generate
 the
  data
and
   sort
   to sort it.
   - Aaron
  
   On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi 
   forpan...@gmail.com
  wrote:
  
Your data is too small I guess for 15 clusters ..So it
  might be
  overhead
time of these clusters making your total MR jobs more
 time
consuming.
I guess you

Re: How to submit a project to Hadoop/Apache

2009-04-15 Thread Aaron Kimball
Tarandeep,

You might want to start by releasing your project as a contrib module for
Hadoop. The overhead there is much easier -- just get it compiliing in the
contrib/ directory, file a JIRA ticket on Hadoop Core, and attach your patch
:)

- Aaron

On Wed, Apr 15, 2009 at 10:29 AM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:


 This is how things get into Apache Incubator: http://incubator.apache.org/
 But the rules are, I believe, that you can skip the incubator and go
 straight under a project's wing (e.g. Hadoop) if the project PMC approves.

  Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
  From: Tarandeep Singh tarand...@gmail.com
  To: core-user@hadoop.apache.org
  Sent: Wednesday, April 15, 2009 1:08:38 PM
  Subject: How to submit a project to Hadoop/Apache
 
  Hi,
 
  Can anyone point me to a documentation which explains how to submit a
  project to Hadoop as a subproject? Also, I will appreciate if someone
 points
  me to the documentation on how to submit a project as Apache project.
 
  We have a project that is built on Hadoop. It is released to the open
 source
  community under GPL license but we are thinking of submitting it as a
 Hadoop
  or Apache project. Any help on how to do this is appreciated.
 
  Thanks,
  Tarandeep




Re: Is combiner and map in same JVM?

2009-04-14 Thread Aaron Kimball
They're in the same JVM, and I believe in the same thread.
- Aaron

On Tue, Apr 14, 2009 at 10:25 AM, Saptarshi Guha
saptarshi.g...@gmail.comwrote:

 Hello,
 Suppose I have a Hadoop job and have set my combiner to the Reducer class.
 Does the map function and the combiner function run in the same JVM in
 different threads? or in different JVMs?
 I ask because I have to load a native library and if they are in the same
 JVM then the native library is loaded once and I have to take  precautions.

 Thank you
 Saptarshi Guha



Re: Map-Reduce Slow Down

2009-04-14 Thread Aaron Kimball
Are there any error messages in the log files on those nodes?
- Aaron

On Tue, Apr 14, 2009 at 9:03 AM, Mithila Nagendra mnage...@asu.edu wrote:

 I ve drawn a blank here! Can't figure out what s wrong with the ports. I
 can
 ssh between the nodes but cant access the DFS from the slaves - says Bad
 connection to DFS. Master seems to be fine.
 Mithila

 On Tue, Apr 14, 2009 at 4:28 AM, Mithila Nagendra mnage...@asu.edu
 wrote:

  Yes I can..
 
 
  On Mon, Apr 13, 2009 at 5:12 PM, Jim Twensky jim.twen...@gmail.com
 wrote:
 
  Can you ssh between the nodes?
 
  -jim
 
  On Mon, Apr 13, 2009 at 6:49 PM, Mithila Nagendra mnage...@asu.edu
  wrote:
 
   Thanks Aaron.
   Jim: The three clusters I setup had ubuntu running on them and the dfs
  was
   accessed at port 54310. The new cluster which I ve setup has Red Hat
  Linux
   release 7.2 (Enigma)running on it. Now when I try to access the dfs
 from
   one
   of the slaves i get the following response: dfs cannot be accessed.
 When
  I
   access the DFS throught the master there s no problem. So I feel there
 a
   problem with the port. Any ideas? I did check the list of slaves, it
  looks
   fine to me.
  
   Mithila
  
  
  
  
   On Mon, Apr 13, 2009 at 2:58 PM, Jim Twensky jim.twen...@gmail.com
   wrote:
  
Mithila,
   
You said all the slaves were being utilized in the 3 node cluster.
  Which
application did you run to test that and what was your input size?
 If
  you
tried the word count application on a 516 MB input file on both
  cluster
setups, than some of your nodes in the 15 node cluster may not be
  running
at
all. Generally, one map job is assigned to each input split and if
 you
   are
running your cluster with the defaults, the splits are 64 MB each. I
  got
confused when you said the Namenode seemed to do all the work. Can
 you
check
conf/slaves and make sure you put the names of all task trackers
  there? I
also suggest comparing both clusters with a larger input size, say
 at
   least
5 GB, to really see a difference.
   
Jim
   
On Mon, Apr 13, 2009 at 4:17 PM, Aaron Kimball aa...@cloudera.com
   wrote:
   
 in hadoop-*-examples.jar, use randomwriter to generate the data
  and
 sort
 to sort it.
 - Aaron

 On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi 
 forpan...@gmail.com
wrote:

  Your data is too small I guess for 15 clusters ..So it might be
overhead
  time of these clusters making your total MR jobs more time
  consuming.
  I guess you will have to try with larger set of data..
 
  Pankil
  On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra 
  mnage...@asu.edu
  wrote:
 
   Aaron
  
   That could be the issue, my data is just 516MB - wouldn't this
  see
   a
 bit
  of
   speed up?
   Could you guide me to the example? I ll run my cluster on it
 and
   see
 what
  I
   get. Also for my program I had a java timer running to record
  the
time
   taken
   to complete execution. Does Hadoop have an inbuilt timer?
  
   Mithila
  
   On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball 
  aa...@cloudera.com
   
  wrote:
  
Virtually none of the examples that ship with Hadoop are
  designed
to
showcase its speed. Hadoop's speedup comes from its ability
 to
 process
   very
large volumes of data (starting around, say, tens of GB per
  job,
and
   going
up in orders of magnitude from there). So if you are timing
  the
   pi
calculator (or something like that), its results won't
   necessarily
be
   very
consistent. If a job doesn't have enough fragments of data
 to
 allocate
   one
per each node, some of the nodes will also just go unused.
   
The best example for you to run is to use randomwriter to
 fill
  up
 your
cluster with several GB of random data and then run the sort
program.
  If
that doesn't scale up performance from 3 nodes to 15, then
  you've
definitely
got something strange going on.
   
- Aaron
   
   
On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra 
mnage...@asu.edu
wrote:
   
 Hey all
 I recently setup a three node hadoop cluster and ran an
   examples
on
  it.
It
 was pretty fast, and all the three nodes were being used
 (I
checked
  the
log
 files to make sure that the slaves are utilized).

 Now I ve setup another cluster consisting of 15 nodes. I
 ran
   the
 same
 example, but instead of speeding up, the map-reduce task
  seems
   to
  take
 forever! The slaves are not being used for some reason.
 This
second
cluster
 has a lower, per node processing power, but should that
 make
   any
 difference

Re: Map-Reduce Slow Down

2009-04-13 Thread Aaron Kimball
in hadoop-*-examples.jar, use randomwriter to generate the data and sort
to sort it.
- Aaron

On Sun, Apr 12, 2009 at 9:33 PM, Pankil Doshi forpan...@gmail.com wrote:

 Your data is too small I guess for 15 clusters ..So it might be overhead
 time of these clusters making your total MR jobs more time consuming.
 I guess you will have to try with larger set of data..

 Pankil
 On Sun, Apr 12, 2009 at 6:54 PM, Mithila Nagendra mnage...@asu.edu
 wrote:

  Aaron
 
  That could be the issue, my data is just 516MB - wouldn't this see a bit
 of
  speed up?
  Could you guide me to the example? I ll run my cluster on it and see what
 I
  get. Also for my program I had a java timer running to record the time
  taken
  to complete execution. Does Hadoop have an inbuilt timer?
 
  Mithila
 
  On Mon, Apr 13, 2009 at 1:13 AM, Aaron Kimball aa...@cloudera.com
 wrote:
 
   Virtually none of the examples that ship with Hadoop are designed to
   showcase its speed. Hadoop's speedup comes from its ability to process
  very
   large volumes of data (starting around, say, tens of GB per job, and
  going
   up in orders of magnitude from there). So if you are timing the pi
   calculator (or something like that), its results won't necessarily be
  very
   consistent. If a job doesn't have enough fragments of data to allocate
  one
   per each node, some of the nodes will also just go unused.
  
   The best example for you to run is to use randomwriter to fill up your
   cluster with several GB of random data and then run the sort program.
 If
   that doesn't scale up performance from 3 nodes to 15, then you've
   definitely
   got something strange going on.
  
   - Aaron
  
  
   On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra mnage...@asu.edu
   wrote:
  
Hey all
I recently setup a three node hadoop cluster and ran an examples on
 it.
   It
was pretty fast, and all the three nodes were being used (I checked
 the
   log
files to make sure that the slaves are utilized).
   
Now I ve setup another cluster consisting of 15 nodes. I ran the same
example, but instead of speeding up, the map-reduce task seems to
 take
forever! The slaves are not being used for some reason. This second
   cluster
has a lower, per node processing power, but should that make any
difference?
How can I ensure that the data is being mapped to all the nodes?
   Presently,
the only node that seems to be doing all the work is the Master node.
   
Does 15 nodes in a cluster increase the network cost? What can I do
 to
setup
the cluster to function more efficiently?
   
Thanks!
Mithila Nagendra
Arizona State University
   
  
 



Re: Changing block size of hadoop

2009-04-12 Thread Aaron Kimball
Blocks already written to HDFS will remain their current size. Blocks are
immutable objects. That procedure would set the size used for all
subsequently-written blocks. I don't think you can change the block size
while the cluster is running, because that would require the NameNode and
DataNodes to re-read their configurations, which they only do at startup.
- Aaron

On Sun, Apr 12, 2009 at 6:08 AM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote:

 Hi,
  I would like to know if it is feasbile to change the blocksize of Hadoop
 while map reduce jobs are executing?  and if not would the following work?
  1.
 stop map-reduce  2. stop-hbase  3. stop hadoop  4. change hadoop-sites.xml
 to reduce the blocksize  5. restart all
  whether the data in the hbase tables will be safe and automatically split
 after changing the block size of hadoop??

 Thanks,
 Raakhi



Re: Map-Reduce Slow Down

2009-04-12 Thread Aaron Kimball
Virtually none of the examples that ship with Hadoop are designed to
showcase its speed. Hadoop's speedup comes from its ability to process very
large volumes of data (starting around, say, tens of GB per job, and going
up in orders of magnitude from there). So if you are timing the pi
calculator (or something like that), its results won't necessarily be very
consistent. If a job doesn't have enough fragments of data to allocate one
per each node, some of the nodes will also just go unused.

The best example for you to run is to use randomwriter to fill up your
cluster with several GB of random data and then run the sort program. If
that doesn't scale up performance from 3 nodes to 15, then you've definitely
got something strange going on.

- Aaron


On Sun, Apr 12, 2009 at 8:39 AM, Mithila Nagendra mnage...@asu.edu wrote:

 Hey all
 I recently setup a three node hadoop cluster and ran an examples on it. It
 was pretty fast, and all the three nodes were being used (I checked the log
 files to make sure that the slaves are utilized).

 Now I ve setup another cluster consisting of 15 nodes. I ran the same
 example, but instead of speeding up, the map-reduce task seems to take
 forever! The slaves are not being used for some reason. This second cluster
 has a lower, per node processing power, but should that make any
 difference?
 How can I ensure that the data is being mapped to all the nodes? Presently,
 the only node that seems to be doing all the work is the Master node.

 Does 15 nodes in a cluster increase the network cost? What can I do to
 setup
 the cluster to function more efficiently?

 Thanks!
 Mithila Nagendra
 Arizona State University



Re: Thin version of Hadoop jar for client

2009-04-10 Thread Aaron Kimball
Not currently, sorry.
- Aaron

On Fri, Apr 10, 2009 at 8:35 AM, Stas Oskin stas.os...@gmail.com wrote:

 Hi.

 Is there any thin version of Hadoop jar, specifically for Java client?

 Regards.



Re: More Replication on dfs

2009-04-10 Thread Aaron Kimball
Changing the default replication in hadoop-site.xml does not affect files
already loaded into HDFS. File replication factor is controlled on a
per-file basis.

You need to use the command `hadoop fs -setrep n path...` to set the
replication factor to n for a particular path already present in HDFS. It
can also take a -R for recursive.

- Aaron

On Fri, Apr 10, 2009 at 10:34 AM, Alex Loddengaard a...@cloudera.comwrote:

 Aseem,

 How are you verifying that blocks are not being replicated?  Have you ran
 fsck?  *bin/hadoop fsck /*

 I'd be surprised if replication really wasn't happening.  Can you run fsck
 and pay attention to Under-replicated blocks and Mis-replicated blocks?
 In fact, can you just copy-paste the output of fsck?

 Alex

 On Thu, Apr 9, 2009 at 11:23 PM, Puri, Aseem aseem.p...@honeywell.com
 wrote:

 
  Hi
 I also tried the command $ bin/hadoop balancer. But still the
  same problem.
 
  Aseem
 
  -Original Message-
  From: Puri, Aseem [mailto:aseem.p...@honeywell.com]
  Sent: Friday, April 10, 2009 11:18 AM
  To: core-user@hadoop.apache.org
  Subject: RE: More Replication on dfs
 
  Hi Alex,
 
 Thanks for sharing your knowledge. Till now I have three
  machines and I have to check the behavior of Hadoop so I want
  replication factor should be 2. I started my Hadoop server with
  replication factor 3. After that I upload 3 files to implement word
  count program. But as my all files are stored on one machine and
  replicated to other datanodes also, so my map reduce program takes input
  from one Datanode only. I want my files to be on different data node so
  to check functionality of map reduce properly.
 
 Also before starting my Hadoop server again with replication
  factor 2 I formatted all Datanodes and deleted all old data manually.
 
  Please suggest what I should do now.
 
  Regards,
  Aseem Puri
 
 
  -Original Message-
  From: Mithila Nagendra [mailto:mnage...@asu.edu]
  Sent: Friday, April 10, 2009 10:56 AM
  To: core-user@hadoop.apache.org
  Subject: Re: More Replication on dfs
 
  To add to the question, how does one decide what is the optimal
  replication
  factor for a cluster. For instance what would be the appropriate
  replication
  factor for a cluster consisting of 5 nodes.
  Mithila
 
  On Fri, Apr 10, 2009 at 8:20 AM, Alex Loddengaard a...@cloudera.com
  wrote:
 
   Did you load any files when replication was set to 3?  If so, you'll
  have
   to
   rebalance:
  
  
  http://hadoop.apache.org/core/docs/r0.19.1/commands_manual.html#balance
  r
   
  
  http://hadoop.apache.org/core/docs/r0.19.1/hdfs_user_guide.html#Rebalanc
  er
   
  
   Note that most people run HDFS with a replication factor of 3.  There
  have
   been cases when clusters running with a replication of 2 discovered
  new
   bugs, because replication is so often set to 3.  That said, if you can
  do
   it, it's probably advisable to run with a replication factor of 3
  instead
   of
   2.
  
   Alex
  
   On Thu, Apr 9, 2009 at 9:56 PM, Puri, Aseem aseem.p...@honeywell.com
   wrote:
  
Hi
   
   I am a new Hadoop user. I have a small cluster with 3
Datanodes. In hadoop-site.xml values of dfs.replication property is
  2
but then also it is replicating data on 3 machines.
   
   
   
Please tell why is it happening?
   
   
   
Regards,
   
Aseem Puri
   
   
   
   
   
   
  
 



Re: Multithreaded Reducer

2009-04-10 Thread Aaron Kimball
Rather than implementing a multi-threaded reducer, why not simply increase
the number of reducer tasks per machine via
mapred.tasktracker.reduce.tasks.maximum, and increase the total number of
reduce tasks per job via mapred.reduce.tasks to ensure that they're all
filled. This will effectively utilize a higher number of cores.

- Aaron

On Fri, Apr 10, 2009 at 11:12 AM, Sagar Naik sn...@attributor.com wrote:

 Hi,
 I would like to implement a Multi-threaded reducer.
 As per my understanding , the system does not have one coz we expect the
 output to be sorted.

 However, in my case I dont need the output sorted.

 Can u pl point to me any other issues or it would be safe to do so

 -Sagar



Re: Multithreaded Reducer

2009-04-10 Thread Aaron Kimball
At that level of parallelism, you're right that the process overhead would
be too high.
- Aaron


On Fri, Apr 10, 2009 at 11:36 AM, Sagar Naik sn...@attributor.com wrote:


 Two things
 - multi-threaded is preferred over multi-processes. The process I m
 planning is IO bound so I can really take advantage of  multi-threads (100
 threads)
 - Correct me if I m wrong. The next MR_JOB in the pipeline will have
  increased number of splits to process as the number of reducer-outputs
 (from prev job) have increased . This leads to increase
  in the map-task completion time.



 -Sagar


 Aaron Kimball wrote:

 Rather than implementing a multi-threaded reducer, why not simply increase
 the number of reducer tasks per machine via
 mapred.tasktracker.reduce.tasks.maximum, and increase the total number of
 reduce tasks per job via mapred.reduce.tasks to ensure that they're all
 filled. This will effectively utilize a higher number of cores.

 - Aaron

 On Fri, Apr 10, 2009 at 11:12 AM, Sagar Naik sn...@attributor.com
 wrote:



 Hi,
 I would like to implement a Multi-threaded reducer.
 As per my understanding , the system does not have one coz we expect the
 output to be sorted.

 However, in my case I dont need the output sorted.

 Can u pl point to me any other issues or it would be safe to do so

 -Sagar









Re: Getting free and used space

2009-04-08 Thread Aaron Kimball
You can insert this propery into the jobconf, or specify it on the command
line e.g.: -D hadoop.job.ugi=username,group,group,group.

- Aaron

On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman bbock...@cse.unl.eduwrote:

 Hey Stas,

 What we do locally is apply the latest patch for this issue:
 https://issues.apache.org/jira/browse/HADOOP-4368

 This makes getUsed (actually, it switches to FileSystem.getStatus) not a
 privileged action.

 As far as specifying the user ... gee, I can't think of it off the top of
 my head.  It's a variable you can insert into the JobConf, but I'd have to
 poke around google or the code to remember which one (I try to not override
 it if possible).

 Brian


 On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote:

  Hi.

 Thanks for the explanation.

 Now for the easier part - how do I specify the user when connecting? :)

 Is it a config file level, or run-time level setting?

 Regards.

 2009/4/8 Brian Bockelman bbock...@cse.unl.edu

  Hey Stas,

 Did you try this as a privileged user?  There might be some permission
 errors... in most of the released versions, getUsed() is only available
 to
 the Hadoop superuser.  It may be that the exception isn't propagating
 correctly.

 Brian


 On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote:

 Hi.


 I'm trying to use the API to get the overall used and free spaces.

 I tried this function getUsed(), but it always returns 0.

 Any idea?

 Thanks.







Re: How many people is using Hadoop Streaming ?

2009-04-08 Thread Aaron Kimball
Excellent. Thanks
- A

On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley omal...@apache.org wrote:


 On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:

  Owen,

 Is binary streaming actually readily available?


 https://issues.apache.org/jira/browse/HADOOP-1722




Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Aaron Kimball
Ooh. The other DCache-based operations assume that you're dcaching files
already resident in HDFS. I guess this assumes that the filenames are on the
local filesystem.

- Aaron

On Wed, Apr 8, 2009 at 8:32 AM, Brian MacKay brian.mac...@medecision.comwrote:


 I use addArchiveToClassPath, and it works for me.

 DistributedCache.addArchiveToClassPath(new Path(path), conf);

 I was curious about this block of code.  Why are you coping to tmp?

 FileSystem fs = FileSystem.get(conf);
 fs.copyFromLocalFile(new Path(aaronTest2.jar), new
  Path(tmp/aaronTest2.jar));

 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Wednesday, April 08, 2009 9:36 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Example of deploying jars through DistributedCache?

 Does it work if you use addArchiveToClassPath()?

 Also, it may be more convenient to use GenericOptionsParser's -libjars
 option.

 Tom

 On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote:
  Hi all,
 
  I'm stumped as to how to use the distributed cache's classpath feature. I
  have a library of Java classes I'd like to distribute to jobs and use in
 my
  mapper; I figured the DCache's addFileToClassPath() method was the
 correct
  means, given the example at
 
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html
 .
 
 
  I've boiled it down to the following non-working example:
 
  in TestDriver.java:
 
 
   private void runJob() throws IOException {
 JobConf conf = new JobConf(getConf(), TestDriver.class);
 
 // do standard job configuration.
 FileInputFormat.addInputPath(conf, new Path(input));
 FileOutputFormat.setOutputPath(conf, new Path(output));
 
 conf.setMapperClass(TestMapper.class);
 conf.setNumReduceTasks(0);
 
 // load aaronTest2.jar into the dcache; this contains the class
  ValueProvider
 FileSystem fs = FileSystem.get(conf);
 fs.copyFromLocalFile(new Path(aaronTest2.jar), new
  Path(tmp/aaronTest2.jar));
 DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
  conf);
 
 // run the job.
 JobClient.runJob(conf);
   }
 
 
   and then in TestMapper:
 
   public void map(LongWritable key, Text value,
  OutputCollectorLongWritable, Text output,
   Reporter reporter) throws IOException {
 
 try {
   ValueProvider vp = (ValueProvider)
  Class.forName(ValueProvider).newInstance();
   Text val = vp.getValue();
   output.collect(new LongWritable(1), val);
 } catch (ClassNotFoundException e) {
   throw new IOException(not found:  + e.toString()); //
 newInstance()
  throws to here.
 } catch (Exception e) {
   throw new IOException(Exception: + e.toString());
 }
   }
 
 
  The class ValueProvider is to be loaded from aaronTest2.jar. I can
 verify
  that this code works if I put ValueProvider into the main jar I deploy. I
  can verify that aaronTest2.jar makes it into the
  ${mapred.local.dir}/taskTracker/archive/
 
  But when run with ValueProvider in aaronTest2.jar, the job fails with:
 
  $ bin/hadoop jar aaronTest1.jar TestDriver
  09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
 process
  : 10
  09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to
 process
  : 10
  09/03/01 22:36:04 INFO mapred.JobClient: Running job:
 job_200903012210_0005
  09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
  09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
  attempt_200903012210_0005_m_00_0, Status : FAILED
  java.io.IOException: not found: java.lang.ClassNotFoundException:
  ValueProvider
 at TestMapper.map(Unknown Source)
 at TestMapper.map(Unknown Source)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 at
  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)
 
 
  Do I need to do something else (maybe in Mapper.configure()?) to actually
  classload the jar? The documentation makes me believe it should already
 be
  in the classpath by doing only what I've done above. I'm on Hadoop
 0.18.3.
 
  Thanks,
  - Aaron
 

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.





Re: connecting two clusters

2009-04-07 Thread Aaron Kimball
Clusters don't really have identities beyond the addresses of the NameNodes
and JobTrackers. In the example below, nn1 and nn2 are the hostnames of the
namenodes of the source and destination clusters. The 8020 in each address
assumes that they're on the default port.

Hadoop provides no inter-task or inter-job synchronization primitives, on
purpose (even within a cluster, the most you get in terms of synchronization
is the ability to join on the status of a running job to determine that
it's completed). The model is designed to be as identity-independent as
possible to make it more resiliant to failure. If individual jobs/tasks
could lock common resources, then the intermittent failure of tasks could
easily cause deadlock.

Using a file as a scoreboard or other communication mechanism between
multiple jobs is not something explicitly designed for, and likely to end in
frustration. Can you describe the goal you're trying to accomplish? It's
likely that there's another, more MapReduce-y way of looking at the job and
refactoring the code to make it work more cleanly with the intended
programming model.

- Aaron

On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu wrote:

 Thanks! I was looking at the link sent by Philip. The copy is done with the
 following command:
 hadoop distcp hdfs://nn1:8020/foo/bar \
hdfs://nn2:8020/bar/foo

 I was wondering if nn1 and nn2 are the names of the clusters or the name of
 the masters on each cluster.

 I wanted map/reduce tasks running on each of the two clusters to
 communicate
 with each other. I dont know if hadoop provides for synchronization between
 two map/reduce tasks. The tasks run simultaneouly, and they need to access
 a
 common file - something like a map/reduce task at a higher level utilizing
 the data produced by the map/reduce at the lower level.

 Mithila

 On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org wrote:

 
  On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
 
   Hey all
  I'm trying to connect two separate Hadoop clusters. Is it possible to do
  so?
  I need data to be shuttled back and forth between the two clusters. Any
  suggestions?
 
 
  You should use hadoop distcp. It is a map/reduce program that copies
 data,
  typically from one cluster to another. If you have the hftp interface
  enabled, you can use that to copy between hdfs clusters that are
 different
  versions.
 
  hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
 
  -- Owen
 



Re: connecting two clusters

2009-04-07 Thread Aaron Kimball
Hi Mithila,

Unfortunately, Hadoop MapReduce jobs determine their inputs as soon as they
begin; the inputs for the job are then fixed. So additional files that
arrive in the input directory after processing has begun, etc, do not
participate in the job.

And HDFS does not currently support appends to files, so existing files
cannot be updated.

A typical way in which this sort of problem is handled is to do processing
in incremental wavefronts; process A generates some data which goes in an
incoming directory for process B; process B starts on a timer every so
often and collects the new input files and works on them. After it's done,
it moves those inputs which it processed into a done directory. In the
mean time, new files may have arrived. After another time interval, another
round of process B starts.  The major limitation of this model is that it
requires that your process work incrementally, or that you are emitting a
small enough volume of data each time in process B that subsequent
iterations can load into memory a summary table of results from previous
iterations. Look into using the DistributedCache to disseminate such files.

Also, why are you using two MapReduce clusters for this, as opposed to one?
Is there a common HDFS cluster behind them?  You'll probably get much better
performance for the overall process if the output data from one job does not
need to be transferred to another cluster before it is further processed.

Does this model make sense?
- Aaron

On Tue, Apr 7, 2009 at 1:06 AM, Mithila Nagendra mnage...@asu.edu wrote:

 Aaron,
 We hope to achieve a level of pipelining between two clusters - similar to
 how pipelining is done in executing RDB queries. You can look at it as the
 producer-consumer problem, one cluster produces some data and the other
 cluster consumes it. The issue that has to be dealt with here is the data
 exchange between the clusters - synchronized interaction between the
 map-reduce jobs on the two clusters is what I m hoping to achieve.

 Mithila

 On Tue, Apr 7, 2009 at 10:10 AM, Aaron Kimball aa...@cloudera.com wrote:

  Clusters don't really have identities beyond the addresses of the
 NameNodes
  and JobTrackers. In the example below, nn1 and nn2 are the hostnames of
 the
  namenodes of the source and destination clusters. The 8020 in each
 address
  assumes that they're on the default port.
 
  Hadoop provides no inter-task or inter-job synchronization primitives, on
  purpose (even within a cluster, the most you get in terms of
  synchronization
  is the ability to join on the status of a running job to determine that
  it's completed). The model is designed to be as identity-independent as
  possible to make it more resiliant to failure. If individual jobs/tasks
  could lock common resources, then the intermittent failure of tasks could
  easily cause deadlock.
 
  Using a file as a scoreboard or other communication mechanism between
  multiple jobs is not something explicitly designed for, and likely to end
  in
  frustration. Can you describe the goal you're trying to accomplish? It's
  likely that there's another, more MapReduce-y way of looking at the job
 and
  refactoring the code to make it work more cleanly with the intended
  programming model.
 
  - Aaron
 
  On Mon, Apr 6, 2009 at 10:08 PM, Mithila Nagendra mnage...@asu.edu
  wrote:
 
   Thanks! I was looking at the link sent by Philip. The copy is done with
  the
   following command:
   hadoop distcp hdfs://nn1:8020/foo/bar \
  hdfs://nn2:8020/bar/foo
  
   I was wondering if nn1 and nn2 are the names of the clusters or the
 name
  of
   the masters on each cluster.
  
   I wanted map/reduce tasks running on each of the two clusters to
   communicate
   with each other. I dont know if hadoop provides for synchronization
  between
   two map/reduce tasks. The tasks run simultaneouly, and they need to
  access
   a
   common file - something like a map/reduce task at a higher level
  utilizing
   the data produced by the map/reduce at the lower level.
  
   Mithila
  
   On Tue, Apr 7, 2009 at 7:57 AM, Owen O'Malley omal...@apache.org
  wrote:
  
   
On Apr 6, 2009, at 9:49 PM, Mithila Nagendra wrote:
   
 Hey all
I'm trying to connect two separate Hadoop clusters. Is it possible
 to
  do
so?
I need data to be shuttled back and forth between the two clusters.
  Any
suggestions?
   
   
You should use hadoop distcp. It is a map/reduce program that copies
   data,
typically from one cluster to another. If you have the hftp interface
enabled, you can use that to copy between hdfs clusters that are
   different
versions.
   
hadoop distcp hftp://namenode1:1234/foo/bar hdfs://foo/bar
   
-- Owen
   
  
 



Re: Why namenode logs into itself as well as job tracker?

2009-04-07 Thread Aaron Kimball
If you're using the start-dfs.sh script, it's ssh'ing into every machine in
conf/masters to start a SecondaryNameNode instance. The NameNode itself is
started locally. It then ssh's into every machine in conf/slaves to start a
DataNode instance.

Similarly, the JT is started locally by start-mapred.sh. But it sshs to
every machine in conf/slaves to start a TaskTracker.

- Aaron

On Sat, Apr 4, 2009 at 6:36 AM, Foss User foss...@gmail.com wrote:

 I have a namenode and job tracker on two different machines.

 I see that a namenode tries to do an ssh log into itself (name node),
 job tracker as well as all slave machines.

 However, the job tracker tries to do an ssh log into the slave
 machines only. Why this difference in behavior? Could someone please
 explain what is going on behind the scenes?



Re: Question on distribution of classes and jobs

2009-04-07 Thread Aaron Kimball
On Fri, Apr 3, 2009 at 11:39 PM, Foss User foss...@gmail.com wrote:

 If I have written a WordCount.java job in this manner:

conf.setMapperClass(Map.class);
conf.setCombinerClass(Combine.class);
conf.setReducerClass(Reduce.class);

 So, you can see that three classes are being used here.  I have
 packaged these classes into a jar file called wc.jar and I run it like
 this:

 $ bin/hadoop jar wc.jar WordCountJob

 1) I want to know when the job runs in a 5 machine cluster, is the
 whole JAR file distributed across the 5 machines or the individual
 class files are distributed individually?


The whole jar.



 2) Also, let us say the number of reducers are 2 while the number of
 mappers are 5. What happens in this case? How are the class files or
 jar files distributed?


It's uploaded into HDFS; specifically into a subdirectory of wherever you
configured mapred.system.dir.



 3) Are they distributed via RPC or HTTP?


The client uses the HDFS protocol to inject its jar file into HDFS. Then all
the TaskTrackers retrieve it with the same protocol


Re: Running MapReduce without setJar

2009-04-07 Thread Aaron Kimball
All the nodes in your Hadoop cluster need access to the class files for your
MapReduce job. The current mechanism that Hadoop has to distribute classes
and attach them to the classpath assumes they're in a JAR together. Thus,
merely specifying the names of mapper/reducer classes with setMapperClass(),
etc, isn't enough -- you need to actually deliver a jar containing those
classes to all your nodes. Since the mapper and reducer classes are separate
classes, you'd need to bundle those .class files together somehow. JAR is
the standard way to do this, so that's what Hadoop supports.

If you're running in fully-local mode (e.g., with
jobConf.set(mapred.job.tracker, local)), then no jar is needed since
it's all running inside the original process space.

- Aaron


On Thu, Apr 2, 2009 at 1:13 PM, Farhan Husain russ...@gmail.com wrote:

 I did all of them i.e. I used setMapClass, setReduceClass and new
 JobConf(MapReduceWork.class) but still it cannot run the job without a jar
 file. I understand the reason that it looks for those classes inside a jar
 but I think there should be some better way to find those classes without
 using a jar. But I am not sure whether it is possible at all.

 On Thu, Apr 2, 2009 at 2:56 PM, Rasit OZDAS rasitoz...@gmail.com wrote:

  You can point to them by using
  conf.setMapClass(..) and conf.setReduceClass(..)  - or something
  similar, I don't have the source nearby.
 
  But something weird has happened to my code. It runs locally when I
  start it as java process (tries to find input path locally). I'm now
  using trunk, maybe something has changed with new version. With
  version 0.19 it was fine.
  Can somebody point out a clue?
 
  Rasit
 



 --
 Mohammad Farhan Husain
 Research Assistant
 Department of Computer Science
 Erik Jonsson School of Engineering and Computer Science
 University of Texas at Dallas



Re: Handling Non Homogenous tasks via Hadoop

2009-04-07 Thread Aaron Kimball
Amit,

The mapred.tasktracker.map.tasks.maximum and
mapred.tasktracker.reduce.tasks.maximum properties can be controlled on a
per-host basis in their hadoop-site.xml files. With this you can configure
nodes with more/fewer cores/RAM/etc to take on varying amounts of work.

There's no current mechanism to provide feedback to the task scheduler,
though, based on actual machine utilization in real time.

- Aaron


On Tue, Apr 7, 2009 at 7:54 AM, amit handa amha...@gmail.com wrote:

 Hi,

 Is there a way I can control number of tasks that can be spawned on a
 machine based on the machine capacity and how loaded the machine already is
 ?

 My use case is as following:

 I have to perform task 1,task2,task3 ...task n .
 These tasks have varied CPU and memory usage patterns.
 All tasks of type task 1,task3 can take 80-90%CPU and 800 MB of RAM.
 All type of tasks task2 take only 1-2% of CPU and 5-10 MB of RAM

 How do i model this using Hadoop ? Can i use only one cluster for running
 all these type of tasks ?
 Shall I use different hadoop clusters for each tasktype , if yes, then how
 do i share data between these tasks (the data can be few MB to few GB)

 Please suggest or point to any docs which i can dig up.

 Thanks,
 Amit



Re: Too many fetch errors

2009-04-07 Thread Aaron Kimball
Xiaolin,

Are you certain that the two nodes can fetch mapper outputs from one
another? If it's taking that long to complete, it might be the case that
what makes it complete is just that eventually it abandons one of your two
nodes and runs everything on a single node where it succeeds -- defeating
the point, of course.

Might there be a firewall between the two nodes that blocks the port used by
the reducer to fetch the mapper outputs? (I think this is on 50060 by
default.)

- Aaron

On Tue, Apr 7, 2009 at 8:08 AM, xiaolin guo xiao...@hulu.com wrote:

 This simple map-recude application will take nearly 1 hour to finish
 running
 on the two-node cluster ,due to lots of Failed/Killed task attempts, while
 in the single node cluster this application only takes 1 minite ... I am
 quite confusing why there are so many Failed/Killed attempts ..

 On Tue, Apr 7, 2009 at 10:40 PM, xiaolin guo xiao...@hulu.com wrote:

  I am trying to setup a small hadoop cluster , everything was ok before I
  moved from single node cluster to two-node cluster. I followed the
 article
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_(Multi-Node_Cluster)http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
 
 http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29to
 config master and slaves.However, when I tried to run the example
  wordcount map-reduce application , the reduce task got stuck in 19% for a
  log time . Then I got a notice:INFO mapred.JobClient: TaskId :
  attempt_200904072219_0001_m_02_0, Status : FAILED too many fetch
  errors  and an error message : Error reading task outputslave.
 
  All map tasks in both task nodes had been finished which could be
 verified
  in task tracker pages.
 
  Both nodes work well in single node mode . And the Hadoop file system
 seems
  to be healthy in multi-node mode.
 
  Can anyone help me with this issue?  Have already got entangled in this
  issue for a long time ...
 
  Thanks very much!
 



Re: How many people is using Hadoop Streaming ?

2009-04-07 Thread Aaron Kimball
Owen,

Is binary streaming actually readily available? Looking at
http://issues.apache.org/jira/browse/HADOOP-3227, it appears uncommitted.

- Aaron


On Fri, Apr 3, 2009 at 8:37 PM, Tim Wintle tim.win...@teamrubber.comwrote:

 On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
1) I can pick the language that offers a different programming
  paradigm (e.g. I may choose functional language, or logic programming
  if they suit the problem better).  In fact, I can even chosen Erlang
  at the map() and Prolog at the reduce().  Mix and match can optimize
  me more.

 Agreed (as someone who has written mappers/reducers in Python, perl,
 shell script and Scheme before).




Re: skip setting output path for a sequential MR job..

2009-03-31 Thread Aaron Kimball
You must remove the existing output directory before running the job. This
check is put in to prevent you from inadvertently destroying or muddling
your existing output data.

You can remove the output directory in advance programmatically with code
similar to:

FileSystem fs = FileSystem.get(conf); // use your JobConf here
fs.delete(new Path(/path/to/output/dir), true);

See
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.htmlfor
more details.

- Aaron


On Mon, Mar 30, 2009 at 9:25 PM, some speed speed.s...@gmail.com wrote:

 Hello everyone,

 Is it necessary to redirect the ouput of reduce to a file? When I am trying
 to run the same M-R job more than once, it throws an error that the output
 file already exists. I dont want to use command line args so I hard coded
 the file name into the program.

 So, Is there a way , I could delete a file on HDFS programatically?
 or can i skip setting a output file path n just have my output print to
 console?
 or can I just append to an existing file?


 Any help is appreciated. Thanks.

 -Sharath



Re: a doubt regarding an appropriate file system

2009-03-30 Thread Aaron Kimball
The short version is that the in-memory structures used by the NameNode are
heavy on a per-file basis, and light on a per-block basis. So petabytes of
files that are only a few hundred KB will require the NameNode to have a
huge amount of memory to hold the filesystem data structures. More than you
want to pay to fit in a single server.

There's more details about this issue in an older post on our blog:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

- Aaron


On Mon, Mar 30, 2009 at 12:46 AM, deepya m_dee...@yahoo.co.in wrote:


 Hi,
  Thanks.

   Can you please specify in detail what kind of problems I will face if I
 use Hadoop for this project.

 SreeDeepya


 TimRobertson100 wrote:
 
  I believe Hadoop is not best suited to many small files like yours but
  is really geared to handling very large files that get split into many
  smaller files (like 128M chunks) and HDFS is designed with this in
  mind.  Therefore I could *imagine* that there are other distributed
  file systems that would far outperform HDFS if they were designed to
  replicate and track small files without any *split* and *merging*
  which Hadoop provides.
 
  Having not used MogileFS I cant really advise well but a quick read
  through does look like it might be a candidate for you to consider -
  it looks like it distributes across machines and tracks replicas like
  HDFS without the splitting, and offers access through http to the
  individual files which I could imagine would be ideal for pulling back
  small images.
 
  Please don't just follow my advise though - I am still a relative
  newbie to DFS's in general.
 
  Cheers
 
  Tim
 
 
 
  On Sun, Mar 29, 2009 at 12:51 PM, deepya m_dee...@yahoo.co.in wrote:
 
  Hi,
   I am doing a project scalable storage server to store images.Can Hadoop
  efficiently support this purpose???Our image size will be around 250 to
  300
  KB each.But we have many such images.Like the total storage may run upto
  petabytes( in future) .At present it is in gigabytes.
We want to access these images via apache server.I mean,is there any
  mechanism that we can directly talk to hdfs via apache server???
 
  I went through one of the posts here and got to know that rather than
  using
  FUSE it is better to use HDFS API.That is fine.But they also mentioned
  that
  mozilefs will be more appropriate.
 
  Can some one please clarify why mozilefs is more appropriate.Cant hadoop
  be
  used???How is mozile more advantageous.Can you suggest which filesystem
  would be more appropriate for the project I am doing at present.
 
  Thanks in advance
 
 
  SreeDeepya
  --
  View this message in context:
 
 http://www.nabble.com/a-doubt-regarding-an-appropriate-file-system-tp22766331p22766331.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 
 
 

 --
 View this message in context:
 http://www.nabble.com/a-doubt-regarding-an-appropriate-file-system-tp22766331p22777879.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: corrupt unreplicated block in dfs (0.18.3)

2009-03-26 Thread Aaron Kimball
Just because a block is corrupt doesn't mean the entire file is corrupt.
Furthermore, the presence/absence of a file in the namespace is a completely
separate issue to the data in the file. I think it would be a surprising
interface change if files suddenly disappeared just because 1 out of
potentially many blocks were corrupt.

- Aaron

On Thu, Mar 26, 2009 at 1:21 PM, Mike Andrews m...@xoba.com wrote:

 i noticed that when a file with no replication (i.e., replication=1)
 develops a corrupt block, hadoop takes no action aside from the
 datanode throwing an exception to the client trying to read the file.
 i manually corrupted a block in order to observe this.

 obviously, with replication=1 its impossible to fix the block, but i
 thought perhaps hadoop would take some other action, such as deleting
 the file outright, or moving it to a corrupt directory, or marking
 it or keeping track of it somehow to note that there's un-fixable
 corruption in the filesystem? thus, the current behaviour seems to
 sweep the corruption under the rug and allows its continued existence,
 aside from notifying the specific client doing the read with an
 exception.

 if anyone has any information about this issue or how to work around
 it, please let me know.

 on the other hand, i tested that corrupting a block in a replication=3
 file causes hadoop to re-replicate the block from another existing
 copy, which is good and is i what i expected.

 best,
 mike


 --
 permanent contact information at http://mikerandrews.com



Re: comma delimited files

2009-03-26 Thread Aaron Kimball
Sure. Put all your comma-delimited data into the output key as a Text
object, and set the output value to the empty string. It'll dump the output
key, as text, to the reducer output files.

- Aaron

On Thu, Mar 26, 2009 at 4:14 PM, nga pham nga.p...@gmail.com wrote:

 Hi all,


 Can Hadoop export into comma delimited files?


 Thank you,

 Nga P.



Re: Using HDFS to serve www requests

2009-03-26 Thread Aaron Kimball
In general, Hadoop is unsuitable for the application you're suggesting.
Systems like Fuse HDFS do exist, though they're not widely used. I don't
know of anyone trying to connect Hadoop with Apache httpd.

When you say that you have huge images, how big is huge? It might be
useful if these images are 1 GB or larger. But in general, huge on Hadoop
means 10s of GBs up to TBs.  If you have a large number of moderately-sized
files, you'll find that HDFS responds very poorly for your needs.

It sounds like glusterfs is designed more for your needs.

- Aaron

On Thu, Mar 26, 2009 at 4:06 PM, phil cryer p...@cryer.us wrote:

 This is somewhat of a noob question I know, but after learning about
 Hadoop, testing it in a small cluster and running Map Reduce jobs on
 it, I'm still not sure if Hadoop is the right distributed file system
 to serve web requests.  In other words, can, or is it right to, serve
 Images and data from HDFS using something like FUSE to mount a
 filesystem where Apache could serve images from it?  We have huge
 images, thus the need for a distributed file system, and they go in,
 get stored with lots of metadata, and are redundant with Hadoop/HDFS -
 but is it the right way to serve web content?

 I looked at glusterfs before, they had an Apache and Lighttpd module
 which made it simple, does HDFS have something like this, do people
 just use a FUSE option as I described, or is this not a good use of
 Hadoop?

 Thanks

 P



Re: Sorting data numerically

2009-03-23 Thread Aaron Kimball
Simplest possible solution: zero-pad your keys to ten places?

- Aaron

On Sat, Mar 21, 2009 at 11:40 PM, Akira Kitada akit...@gmail.com wrote:

 Hi,

 By default Hadoop does ASCII sort the mapper's output, not numeric sort.
 However, I often want the framework to sort
 records in numeric order.
 Can I make the framework to do numeric sort?
 (I use Hadoop Streaming)

 Thanks,

 Akira



Re: Reduce task going away for 10 seconds at a time

2009-03-16 Thread Aaron Kimball
If you jstack the process in the middle of one of these pauses, can you see
where it's sticking?
- Aaron

On Fri, Mar 13, 2009 at 6:51 AM, Doug Cook nab...@candiru.com wrote:


 Hi folks,

 I've been debugging a severe performance problems with a Hadoop-based
 application (a highly modified version of Nutch). I've recently upgraded to
 Hadoop 0.19.1 from a much, much older version, and a reduce that used to
 work just fine is now running orders of magnitude more slowly.

 From the logs I can see that progress of my reduce stops for periods that
 average almost exactly 10 seconds (with a very narrow distribution around
 10
 seconds), and it does so in various places in my code, but more or less in
 proportion to how much time I'd expect the task would normally spend in
 that
 particular place in the code, i.e. the behavior seems like my code is
 randomly being interrupted for 10 seconds at a time.

 I'm planning to keep digging, but thought that these symptoms might sound
 familiar to someone on this list. Ring any bells? Your help much
 appreciated.

 Thanks!

 Doug Cook
 --
 View this message in context:
 http://www.nabble.com/Reduce-task-going-away-for-10-seconds-at-a-time-tp22496810p22496810.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Running 0.19.2 branch in production before release

2009-03-05 Thread Aaron Kimball
Right, there's no sense in freezing your Hadoop version forever :)

But if you're an ops team tasked with keeping a production cluster running
24/7, running on 0.19 (or even more daringly, TRUNK) is not something that I
would consider a Best Practice. Ideally you'll be able to carve out some
spare capacity (maybe 3--5 nodes) to use as a staging cluster that runs on
0.19 or TRUNK that you can use to evaluate the next version. Then when you
are convinced that it's stable, and your staging cluster passes your
internal tests (e.g., running test versions of your critical nightly jobs
successfully), you can move that to production.

- Aaron


On Thu, Mar 5, 2009 at 2:33 AM, Steve Loughran ste...@apache.org wrote:

 Aaron Kimball wrote:

 I recommend 0.18.3 for production use and avoid the 19 branch entirely. If
 your priority is stability, then stay a full minor version behind, not
 just
 a revision.


 Of course, if everyone stays that far behind, they don't get to find the
 bugs for other people.

 * If you play with the latest releases early, while they are in the beta
 phase -you will encounter the problems specific to your
 applications/datacentres, and get them fixed fast.

 * If you work with stuff further back you get stability, but not only are
 you behind on features, you can't be sure that all fixes that matter to
 you get pushed back.

 * If you plan on making changes, of adding features, get onto SVN_HEAD

 * If you want to catch changes being made that break your site, SVN_HEAD.
 Better yet, have a private Hudson server checking out SVN_HEAD hadoop *then*
 building and testing your app against it.

 Normally I work with stable releases of things I dont depend on, and
 SVN_HEAD of OSS stuff whose code I have any intent to change; there is a
 price -merge time, the odd change breaking your code- but you get to make
 changes that help you long term.

 Where Hadoop is different is that it is a filesystem, and you don't want to
 hit bugs that delete files that matter. I'm only bringing up transient
 clusters on VMs, pulling in data from elsewhere, so this isn't an issue. All
 that remains is changing APIs.

 -Steve



Re: Repartitioned Joins

2009-03-05 Thread Aaron Kimball
Richa,

Since the mappers run independently, you'd have a hard time
determining whether a record in mapper A would be joined by a record
in mapper B. The solution, as it were, would be to do this in two
separate MapReduce passes:

* Take an educated guess at which table is the smaller data set.
* Run a MapReduce over this dataset, building up a bloom filter for
the record ids. Set entries in the filter to 1 for each record id you
see; leave the rest as 0.
* The bloom filter now has 1 meaning maybe joinable and 0 meaning
definitely not joinable.
* Run a second MapReduce job over both datasets. Use the distributed
cache to send the filter to all mappers. Mappers emit all records
where filter[hash(record_id)] == 1.

- Aaron

On Wed, Mar 4, 2009 at 11:18 AM, Richa Khandelwal richa...@gmail.com wrote:
 Hi All,
 Does anyone know of tweaking in map-reduce joins that will optimize it
 further in terms of the moving only those tuples to reduce phase that join
 in the two tables? There are replicated joins and semi-join strategies but
 they are more of databases than map-reduce.

 Thanks,
 Richa Khandelwal
 University Of California,
 Santa Cruz.
 Ph:425-241-7763



Re: Throw an exception if the configure method fails

2009-03-05 Thread Aaron Kimball
Try throwing RuntimeException, or any other unchecked exception (e.g., any
descendant classes of RuntimeException)
- Aaron

On Thu, Mar 5, 2009 at 4:24 PM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 hello,
 I'm not that comfortable with java, so here is my question. In the
 MapReduceBase class, i have implemented the configure method, which
 does not throw an exception. Suppose I detect an error in some
 options, i wish to raise an exception(in the configure method) - is
 there a way to do that? Is there a way to stop the job in case the
 configure method fails?,
 Saptarshi Guha

 [1] My map extends MapReduceBase and my reduce extends MapReduceBase -
 two separate classes.

 Thank you



Re: The cpu preemption between MPI and Hadoop programs on Same Cluster

2009-03-05 Thread Aaron Kimball
Song, you should be able to use 'nice' to reprioritize the MPI task
below that of your Hadoop jobs.
- Aaron

On Thu, Mar 5, 2009 at 8:26 PM, 柳松 lamfeel...@126.com wrote:

 Dear all:
 I run my hadoop program with another MPI program on the same cluster. 
 here is the result of top.
  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
 11750 qianglv   25   0  233m  99m 6100 R 99.7  2.5 116:05.59 rosetta.mpich
 18094 cip   17   0 3136m  68m  15m S  0.5  1.7   0:12.69 java
 18244 cip   17   0 3142m  80m  15m S  0.2  2.0   0:17.61 java
 18367 cip   18   0 2169m  88m  15m S  0.1  2.3   0:17.46 java
 18012 cip   18   0 3141m  77m  15m S  0.1  2.0   0:14.49 java
 18584 cip   21   0 m  46m  15m S  0.1  1.2   0:05.12 java

 My Hadoop program can only get no more than 1 percent cpu time slide in 
 total, compared with the rosetta.mpich program's 99.7%.

 I'm sure my program is in progress since the log files told me, they are 
 running normally.

 Someone told me, it's the nature of Java program, low cpu priority, 
 especially compared with C program.

 Is that true?

 Regards
 Song Liu in Suzhou University.


Example of deploying jars through DistributedCache?

2009-03-01 Thread Aaron Kimball
Hi all,

I'm stumped as to how to use the distributed cache's classpath feature. I
have a library of Java classes I'd like to distribute to jobs and use in my
mapper; I figured the DCache's addFileToClassPath() method was the correct
means, given the example at
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.


I've boiled it down to the following non-working example:

in TestDriver.java:


  private void runJob() throws IOException {
JobConf conf = new JobConf(getConf(), TestDriver.class);

// do standard job configuration.
FileInputFormat.addInputPath(conf, new Path(input));
FileOutputFormat.setOutputPath(conf, new Path(output));

conf.setMapperClass(TestMapper.class);
conf.setNumReduceTasks(0);

// load aaronTest2.jar into the dcache; this contains the class
ValueProvider
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(aaronTest2.jar), new
Path(tmp/aaronTest2.jar));
DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
conf);

// run the job.
JobClient.runJob(conf);
  }


 and then in TestMapper:

  public void map(LongWritable key, Text value,
OutputCollectorLongWritable, Text output,
  Reporter reporter) throws IOException {

try {
  ValueProvider vp = (ValueProvider)
Class.forName(ValueProvider).newInstance();
  Text val = vp.getValue();
  output.collect(new LongWritable(1), val);
} catch (ClassNotFoundException e) {
  throw new IOException(not found:  + e.toString()); // newInstance()
throws to here.
} catch (Exception e) {
  throw new IOException(Exception: + e.toString());
}
  }


The class ValueProvider is to be loaded from aaronTest2.jar. I can verify
that this code works if I put ValueProvider into the main jar I deploy. I
can verify that aaronTest2.jar makes it into the
${mapred.local.dir}/taskTracker/archive/

But when run with ValueProvider in aaronTest2.jar, the job fails with:

$ bin/hadoop jar aaronTest1.jar TestDriver
09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
: 10
09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
: 10
09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
attempt_200903012210_0005_m_00_0, Status : FAILED
java.io.IOException: not found: java.lang.ClassNotFoundException:
ValueProvider
at TestMapper.map(Unknown Source)
at TestMapper.map(Unknown Source)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


Do I need to do something else (maybe in Mapper.configure()?) to actually
classload the jar? The documentation makes me believe it should already be
in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.

Thanks,
- Aaron


Re: RecordReader and non thread safe JNI libraries

2009-03-01 Thread Aaron Kimball
It's situation (2). Each map task gets its own JVM instance; this has its
own RecordReader and its own Mapper implementation. There's basically a loop
in each task jvm that says:

while (recordReader.hasNext()) {
  recordReader.getNext(k, v);
  myMapper.map(k, v, output, reporter);
}

If your mapper and the RR use the same library and tread on one another's
state, you're going to have undefined results.

- Aaron


On Sun, Mar 1, 2009 at 8:33 PM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 I am quite confused and my email seems to prove it. My question is
 essentially, I need to use this non thread safe library in the Mapper,
 Reducer and RecordReader. assume, i do not create threads.
 Will I run into any thread safety issues?

 In a given JVM, the maps will run sequentially, so will the reduces,
 but will maps run alongside recorder reader?

 Hope this is clearer.
 Regards


 Saptarshi Guha



 On Sun, Mar 1, 2009 at 11:07 PM, Saptarshi Guha
 saptarshi.g...@gmail.com wrote:
  Hello,
  My RecordReader subclass reads from object X. To parse this object and
  emit records, i need the use of a C library and a JNI wrapper.
 
 public boolean next(LongWritable key, BytesWritable value) throws
 IOException {
 if (leftover == 0) return false;
 long wi = pos + split.getStart();
 key.set(wi);
 value.readFields(X.at( wi);
 pos ++; leftover --;
 return true;
 }
 
  X.at uses the JNI lib to read a record number wi
 
  My question is who running this?
  1) For a given job, is one instance of this running on each
  tasktracker? reading records and feeding to the mappers on its
  machine?
  Or,
  2) as I have mapred.tasktracker.map.tasks.maximum == 7, does each jvm
  launched have one RecordReader running feeding records to the maps its
  jvm is running.
 
  If it's either (1) or (2), I guess I'm safe from threading issues.
 
  Please correct me if i'm totally wrong.
  Regards
 
  Saptarshi Guha
 



Re: GenericOptionsParser warning

2009-02-18 Thread Aaron Kimball
You should put this stub code in your program as the means to start your
MapReduce job:

public class Foo extends Configured implements Tool {

  public int run(String [] args) throws IOException {
JobConf conf = new JobConf(getConf(), Foo.class);
// run the job here.
return 0;
  }

  public static void main(String [] args) throws Exception {
int ret = ToolRunner.run(new Foo(), args); // calls your run() method.
System.exit(ret);
  }
}


On Wed, Feb 18, 2009 at 7:09 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Hi,
 There is a JIRA issue about this problem, if I understand it correctly:
 https://issues.apache.org/jira/browse/HADOOP-3743

 Strange, that I searched all source code, but there exists only this
 control
 in 2 places:

 if (!(job.getBoolean(mapred.used.genericoptionsparser, false))) {
  LOG.warn(Use GenericOptionsParser for parsing the arguments.  +
   Applications should implement Tool for the same.);
}

 Just an if block for logging, no extra controls.
 Am I missing something?

 If your class implements Tool, than there shouldn't be a warning.

 Cheers,
 Rasit

 2009/2/18 Steve Loughran ste...@apache.org

  Sandhya E wrote:
 
  Hi All
 
  I prepare my JobConf object in a java class, by calling various set
  apis in JobConf object. When I submit the jobconf object using
  JobClient.runJob(conf), I'm seeing the warning:
  Use GenericOptionsParser for parsing the arguments. Applications
  should implement Tool for the same. From hadoop sources it looks like
  setting mapred.used.genericoptionsparser will prevent this warning.
  But if I set this flag to true, will it have some other side effects.
 
  Thanks
  Sandhya
 
 
  Seen this message too -and it annoys me; not tracked it down
 



 --
 M. Raşit ÖZDAŞ



Re: Finding small subset in very large dataset

2009-02-11 Thread Aaron Kimball
I don't see why a HAR archive needs to be involved. You can use a MapFile to
create a scannable index over a SequenceFile and do lookups that way.

But if A is small enough to fit in RAM, then there is a much simpler way:
Write it out to a file and disseminate to all mappers via the
DistributedCache. They then reach read in the entire A set into a HashSet or
other data structure during configure(), before they scan through their
slices of B. They then emit only the B values which hit in A. This is called
a map-side join. If you don't care about sorted ordering of your results,
you can then disable the reducers entirely.

Hive already supports this behavior; but you have to explicitly tell it to
enable map-side joins for each query because only you know that one data set
is small enough ahead of time.

If your A set doesn't fit in RAM, you'll need to get more creative. One
possibility is to do the same thing as above, but instead of reading all of
A into memory, use a hash function to squash the keys from A into some
bounded amount of RAM. For example, allocate yourself a 256 MB bitvector;
for each key in A, set bitvector[hash(A_key) % len(bitvector)] = 1. Then for
each B key in the mapper, if bitvector[hash(B_key) % len(bitvector)] == 1,
then it may match an A key; if it's 0 then it definitely does not match an A
key. For each potential match, send it to the reducer. Send all the A keys
to the reducer as well, where the  precise joining will occur. (Note: this
is effectively the same thing as a Bloom Filter.)

This will send much less data to each reducer and should see better
throughput.

- Aaron


On Wed, Feb 11, 2009 at 4:07 PM, Amit Chandel amitchan...@gmail.com wrote:

 Are the keys in collection B unique?

 If so, I would like to try this approach:
 For each key, value of collection B, make a file out of it with file name
 given by MD5 hash of the key, and value being its content, and then
 store all these files into a HAR archive.
 The HAR archive will create an index for you over the keys.
 Now you can iterate over the collection A, get the MD5 hash of the key, and
 look up in the archive for the file (to get the value).

 On Wed, Feb 11, 2009 at 4:39 PM, Thibaut_ tbr...@blue.lu wrote:

 
  Hi,
 
  Let's say the smaller subset has name A. It is a relatively small
  collection
   100 000 entries (could also be only 100), with nearly no payload as
  value.
  Collection B is a big collection with 10 000 000 entries (Each key of A
  also exists in the collection B), where the value for each key is
  relatively
  big ( 100 KB)
 
  For all the keys in A, I need to get the corresponding value from B and
  collect it in the output.
 
 
  - I can do this by reading in both files, and on the reduce step, do my
  computations and collect only those which are both in A and B. The map
  phase
  however will take very long as all the key/value pairs of collection B
 need
  to be sorted (and each key's value is 100 KB) at the end of the map
 phase,
  which is overkill if A is very small.
 
  What I would need is an option to somehow make the intersection first
  (Mapper only on keys, then a reduce functino based only on keys and not
 the
  corresponding values which collects the keys I want to take), and then
  running the map input and filtering the output collector or the input
 based
  on the results from the reduce phase.
 
  Or is there another faster way? Collection A could be so big that it
  doesn't
  fit into the memory. I could split collection A up into multiple smaller
  collections, but that would make it more complicated, so I want to evade
  that route. (This is similar to the approach I described above, just a
  manual approach)
 
  Thanks,
  Thibaut
  --
  View this message in context:
 
 http://www.nabble.com/Finding-small-subset-in-very-large-dataset-tp21964853p21964853.html
  Sent from the Hadoop core-user mailing list archive at Nabble.com.
 
 



Re: HDD benchmark/checking tool

2009-02-03 Thread Aaron Kimball
Dmitry,

Look into cluster/system monitoring tools: nagios and ganglia are two to
start with.
- Aaron

On Tue, Feb 3, 2009 at 9:53 AM, Dmitry Pushkarev u...@stanford.edu wrote:

 Dear hadoop users,



 Recently I have had a number of drive failures that slowed down processes a
 lot until they were discovered. It is there any easy way or tool, to check
 HDD performance and see if there any IO errors?

 Currently I wrote a simple script that looks at /var/log/messages and greps
 everything abnormal for /dev/sdaX. But if you have better solution I'd
 appreciate if you share it.



 ---

 Dmitry Pushkarev

 +1-650-644-8988






Re: Hadoop's reduce tasks are freezes at 0%.

2009-02-03 Thread Aaron Kimball
Alternatively, maybe all the TaskTracker nodes can contact the NameNode and
JobTracker, but cannot communicate with one another due to firewall issues?
- Aaron

On Mon, Feb 2, 2009 at 7:54 PM, jason hadoop jason.had...@gmail.com wrote:

 A reduce stall at 0% implies that the map tasks are not outputting any
 records via the output collector.
 You need to go look at the task tracker and the task logs on all of your
 slave machines, to see if anything that seems odd appears in the logs.
 On the tasktracker web interface detail screen for your job,
 Are all of the map tasks finished
 Are any of the map tasks started
 Are there any Tasktracker nodes to service your job

 On Sun, Feb 1, 2009 at 11:41 PM, Kwang-Min Choi kmbest.c...@samsung.com
 wrote:

  I'm newbie in Hadoop.
  and i'm trying to follow Hadoop Quick Guide at hadoop homepage.
  but, there are some problems...
  Downloading, unzipping hadoop is done.
  and ssh successfully operate without password phrase.
 
  once... I execute grep example attached to Hadoop...
  map task is ok. it reaches 100%.
  but reduce task freezes at 0% without any error message.
  I've waited it for more than 1 hour, but it still freezes...
  same job in standalone mode is well done...
 
  i tried it with version 0.18.3 and 0.17.2.1.
  all of them had same problem.
 
  could help me to solve this problem?
 
  Additionally...
  I'm working on cloud-infra of GoGrid(Redhat).
  So, disk's space  health is OK.
  and, i've installed JDK 1.6.11 for linux successfully.
 
 
  - KKwams
 



Re: extra documentation on how to write your own partitioner class

2009-02-03 Thread Aaron Kimball
er?

It seems to be using value.get(). That having been said, you should really
partition based on key, not on value. (I am not sure why, exactly, the value
is provided to the getPartition() method.)


Moreover, I think the problem is that you are using division ( / ) not
modulus ( % ).  Your code simplifies to:   (value.get() / T) / (T /
numPartitions) = value.get() * numPartitions / T^2.

The contract of getPartition() is that it returns a value in [0,
numPartitions). The division operators are not guaranteed to return anything
in this range, but (foo % numPartitions) will always do the right thing. So
it's probably just  assigning everything to reduce partition 0.
(Alternatively, it could be that value * numPartitions  T^2 for any values
of T you're testing with, which means that integer division will return 0.)

- Aaron


On Fri, Jan 30, 2009 at 3:43 PM, Sandy snickerdoodl...@gmail.com wrote:

 Hi James,

 Thank you very much! :-)

 -SM

 On Fri, Jan 30, 2009 at 4:17 PM, james warren ja...@rockyou.com wrote:

  Hello Sandy -
  Your partitioner isn't using any information from the key/value pair -
 it's
  only using the value T which is read once from the job configuration.
   getPartition() will always return the same value, so all of your data is
  being sent to one reducer. :P
 
  cheers,
  -James
 
  On Fri, Jan 30, 2009 at 1:32 PM, Sandy snickerdoodl...@gmail.com
 wrote:
 
   Hello,
  
   Could someone point me toward some more documentation on how to write
  one's
   own partition class? I have having quite a bit of trouble getting mine
 to
   work. So far, it looks something like this:
  
   public class myPartitioner extends MapReduceBase implements
   PartitionerIntWritable, IntWritable {
  
  private int T;
  
  public void configure(JobConf job) {
  super.configure(job);
  String myT = job.get(tval);//this is user defined
  T = Integer.parseInt(myT);
  }
  
  public int getPartition(IntWritable key, IntWritable value, int
   numReduceTasks) {
  int newT = (T/numReduceTasks);
  int id = ((value.get()/ T);
  return (int)(id/newT);
  }
   }
  
   In the run() function of my M/R program I just set it using:
  
   conf.setPartitionerClass(myPartitioner.class);
  
   Is there anything else I need to set in the run() function?
  
  
   The code compiles fine. When I run it, I know it is using the
   partitioner,
   since I get different output than if I just let it use HashPartitioner.
   However, it is not splitting between the reducers at all! If I set the
   number of reducers to 2, all the output shows up in part-0, while
   part-1 has nothing.
  
   I am having trouble debugging this since I don't know how I can observe
  the
   values of numReduceTasks (which I assume is being set by the system).
 Is
   this a proper assumption?
  
   If I try to insert any println() statements in the function, it isn't
   outputted to either my terminal or my log files. Could someone give me
  some
   general advice on how best to debug pieces of code like this?
  
 



Re: HADOOP-2536 supports Oracle too?

2009-02-03 Thread Aaron Kimball
HADOOP-2536 supports the JDBC connector interface. Any database that exposes
a JDBC library will work
- Aaron

On Tue, Feb 3, 2009 at 6:17 PM, Amandeep Khurana ama...@gmail.com wrote:

 Does the patch HADOOP-2536 support connecting to Oracle databases as well?
 Or is it just limited to MySQL?

 Amandeep


 Amandeep Khurana
 Computer Science Graduate Student
 University of California, Santa Cruz



Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Hi David,

If your tasks are failing on only the new nodes, it's likely that you're
missing a library or something on those machines. See this Hadoop tutorial
http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
distributing debug scripts. These will allow you to capture stdout/err and
the syslog from tasks that fail.

- Aaron

On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik sn...@attributor.com wrote:

 Pl check which nodes have these failures.

 I guess the new tasktrackers/machines  are not configured correctly.
 As a result, the map-task will die and the remaining map-tasks will be
 sucked onto these machines


 -Sagar


 David J. O'Dell wrote:

 We've been running 0.18.2 for over a month on an 8 node cluster.
 Last week we added 4 more nodes to the cluster and have experienced 2
 failures to the tasktrackers since then.
 The namenodes are running fine but all jobs submitted will die when
 submitted with this error on the tasktrackers.

 2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
 LaunchTaskAction: attempt_200901280756_0012_m_74_2
 2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
 attempt_200901280756_0012_m_74_2 Child Error
 java.io.IOException: Task process exit with nonzero status of 1.
at
 org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)

 I tried running the tasktrackers in debug mode but the entries above are
 all that show up in the logs.
 As of now my cluster is down.






Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Wow. How many subdirectories were there? how many jobs do you run a day?

- Aaron

On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell dod...@videoegg.comwrote:

 It was failing on all the nodes both new and old.
 The problem was there were too many subdirectories under
 $HADOOP_HOME/logs/userlogs
 The fix was just to delete the subdirs and change this setting from 24
 hours(the default) to 2 hours.
 mapred.userlog.retain.hours

 Would have been nice if there was an error message that pointed to this.


 Aaron Kimball wrote:
  Hi David,
 
  If your tasks are failing on only the new nodes, it's likely that you're
  missing a library or something on those machines. See this Hadoop
 tutorial
  http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about
  distributing debug scripts. These will allow you to capture stdout/err
 and
  the syslog from tasks that fail.
 
  - Aaron
 
  On Wed, Jan 28, 2009 at 9:40 AM, Sagar Naik sn...@attributor.com
 wrote:
 
 
  Pl check which nodes have these failures.
 
  I guess the new tasktrackers/machines  are not configured correctly.
  As a result, the map-task will die and the remaining map-tasks will be
  sucked onto these machines
 
 
  -Sagar
 
 
  David J. O'Dell wrote:
 
 
  We've been running 0.18.2 for over a month on an 8 node cluster.
  Last week we added 4 more nodes to the cluster and have experienced 2
  failures to the tasktrackers since then.
  The namenodes are running fine but all jobs submitted will die when
  submitted with this error on the tasktrackers.
 
  2009-01-28 08:07:55,556 INFO org.apache.hadoop.mapred.TaskTracker:
  LaunchTaskAction: attempt_200901280756_0012_m_74_2
  2009-01-28 08:07:55,682 WARN org.apache.hadoop.mapred.TaskRunner:
  attempt_200901280756_0012_m_74_2 Child Error
  java.io.IOException: Task process exit with nonzero status of 1.
 at
  org.apache.hadoop.mapred.TaskRunner.runChild(TaskRunner.java:462)
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:403)
 
  I tried running the tasktrackers in debug mode but the entries above
 are
  all that show up in the logs.
  As of now my cluster is down.
 
 
 
 

 --
 David O'Dell
 Director, Operations
 e: dod...@videoegg.com
 t:  (415) 738-5152
 180 Townsend St., Third Floor
 San Francisco, CA 94107




Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Aaron Kimball
By scrub do you mean delete the blocks from the node?

Read your conf/hadoop-site.xml file to determine where dfs.data.dir points,
then for each directory in that list, just rm the directory. If you want to
ensure that your data is preserved with appropriate replication levels on
the rest of your clutser, you should use Hadoop's DataNode Decommission
feature to up-replicate the data before you blow a copy away.

- Aaron

On Wed, Jan 28, 2009 at 2:10 PM, Sriram Rao srirams...@gmail.com wrote:

 Hi,

 Is there a tool that one could run on a datanode to scrub all the
 blocks on that node?

 Sriram



Re: What happens in HDFS DataNode recovery?

2009-01-26 Thread Aaron Kimball
Also, see the balancer tool that comes with Hadoop. This background process
should be run periodically (Every week or so?) to make sure that data's
evenly distributed.

http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#Rebalancer

- Aaron

On Sat, Jan 24, 2009 at 7:40 PM, jason hadoop jason.had...@gmail.comwrote:

 The blocks will be invalidated on the returned to service datanode.
 If you want to save your namenode and network a lot of work, wipe the hdfs
 block storage directory before returning the Datanode to service.
 dfs.data.dir will be the directory, most likley the value is
 ${hadoop.tmp.dir}/dfs/data

 Jason - Ex Attributor

 On Sat, Jan 24, 2009 at 6:19 PM, C G parallel...@yahoo.com wrote:

  Hi All:
 
  I elected to take a node out of one of our grids for service.  Naturally
  HDFS recognized the loss of the DataNode and did the right stuff, fixing
  replication issues and ultimately delivering a clean file system.
 
  So now the node I removed is ready to go back in service.  When I return
 it
  to service a bunch of files will suddenly have a replication of 4 instead
 of
  3.  My questions:
 
  1.  Will HDFS delete a copy of the data to bring replication back to 3?
  2.  If (1) above is  yes, will it remove the copy by deleting from other
  nodes, or will it remove files from the returned node, or both?
 
  The motivation for asking the questions are that I have a file system
 which
  is extremely unbalanced - we recently doubled the size of the grid when a
  few dozen terabytes already stored on the existing nodes.  I am wondering
 if
  an easy way to restore some sense of balance is to cycle through the old
  nodes, removing each one from service for several hours and then return
 it
  to service.
 
  Thoughts?
 
  Thanks in Advance,
  C G
 
 
 
 
 
 
 



Re: Mapred job parallelism

2009-01-26 Thread Aaron Kimball
Indeed, you will need to enable the Fair Scheduler or Capacity Scheduler
(which are both in 0.19) to do this. mapred.map.tasks is more a hint than
anything else -- if you have more files to map than you set this value to,
it will use more tasks than you configured the job to. The newer schedulers
will ensure that each job's many map tasks are only using a portion of the
available slots.
- Aaron

On Mon, Jan 26, 2009 at 1:43 PM, jason hadoop jason.had...@gmail.comwrote:

 I believe that the schedule code in 0.19.0 has a framework for this, but I
 haven't dug into it in detail yet.

 http://hadoop.apache.org/core/docs/r0.19.0/capacity_scheduler.html

 From what I gather you would set up 2 queues, each with guaranteed access
 to
 1/2 of the cluster
 Then you submit your jobs to alternate queues.

 This is not ideal as you have to balance what queue you submit jobs to, to
 ensure that there is some depth.


 On Mon, Jan 26, 2009 at 1:30 PM, Sagar Naik sn...@attributor.com wrote:

  Hi Guys,
 
  I was trying to setup a cluster so that two jobs can run simultaneously.
 
  The conf :
  number of nodes : 4(say)
  mapred.tasktracker.map.tasks.maximum=2
 
 
  and in the joblClient
  mapred.map.tasks=4 (# of nodes)
 
 
  I also have a condition, that each job should have only one map-task per
  node
 
  In short, created 8 map slots and set the number of mappers to 4.
  So now, we have two jobs running simultaneously
 
  However, I realized that, if a tasktracker happens to die, potentially, I
  will have 2 map-tasks running on a node
 
 
  Setting mapred.tasktracker.map.tasks.maximum=1 in Jobclient has no
 effect.
  It is tasktracker property and cant be changed per job
 
  Any ideas on how to have 2 jobs running simultaneously ?
 
 
  -Sagar
 
 
 
 
 
 
 



Re: Distributed cache testing in local mode

2009-01-22 Thread Aaron Kimball
Hi Bhupesh,

I've noticed the same problem -- LocalJobRunner makes the DistributedCache
effectively not work; so my code often winds up with two codepaths to
retrieve the local data :\

You could try running in pseudo-distributed mode to test, though then you
lose the ability to run a single-stepping debugger on the whole end-to-end
process.

- Aaron

On Thu, Jan 22, 2009 at 11:29 AM, Bhupesh Bansal bban...@linkedin.comwrote:

 Hey folks,

 I am trying to use Distributed cache in hadoop jobs to pass around
 configuration files , external-jars (job sepecific) and some archive data.

 I want to test Job end-to-end in local mode, but I think the distributed
 caches are localized in TaskTracker code which is not called in local mode
 Through LocalJobRunner.

 I can do some fairly simple workarounds for this but was just wondering if
 folks have more ideas about it.

 Thanks
 Bhupesh




Re: Hadoop 0.17.1 = EOFException reading FSEdits file, what causes this? how to prevent?

2009-01-15 Thread Aaron Kimball
Does the file exist or maybe was it deleted? Also, are the permissions on
that directory set correctly, or could they have been changed out from under
you by accident?

- Aaron

On Tue, Jan 13, 2009 at 9:53 AM, Joe Montanez jmonta...@veoh.com wrote:

 Hi:



 I'm using Hadoop 0.17.1 and I'm encountering EOFException reading the
 FSEdits file.  I don't have a clear understanding what is causing this
 and how to prevent this.  Has anyone seen this and can advise?



 Thanks in advance,

 Joe



 2009-01-12 22:51:45,573 ERROR org.apache.hadoop.dfs.NameNode:
 java.io.EOFException

at java.io.DataInputStream.readFully(DataInputStream.java:180)

at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)

at
 org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)

at
 org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:599)

at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:766)

at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:640)

at
 org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:223)

at
 org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:80)

at
 org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:274)

at
 org.apache.hadoop.dfs.FSNamesystem.init(FSNamesystem.java:255)

at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:133)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)

at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)

at
 org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)

at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)



 2009-01-12 22:51:45,574 INFO org.apache.hadoop.dfs.NameNode:
 SHUTDOWN_MSG:






Re: Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Aaron Kimball
The actual number of input records is most likely steadily increasing. The
counters on the web site are inaccurate until the job is complete; their
values will fluctuate wildly. I'm not sure why this is.

- Aaron

On Mon, Jan 5, 2009 at 8:34 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote:

 Hello,
 When I check the job tracker web page, and look at the Map Input
 records read,the map input records goes up to say 1.4MN and then drops
 to 410K and then goes up again.
 The same happens with input/output bytes and output records.

 Why is this? Is there something wrong with the mapper code? In my map
 function, i assume I have received one line of input.
 The oscillatory behavior does not occur for tiny datasets, but for 1GB
 of data (tiny for others) i see this happening.

 Thank s
 Saptarshi
 --
 Saptarshi Guha - saptarshi.g...@gmail.com



  1   2   >