Re: mini node in a cluster

2012-06-04 Thread Tom Melendez
Hi Pat,

Sounds like you would just turn off the datanode and the tasktracker.
Your config will still point to the Namenode and JT, so you can still
launch jobs and read/write from HDFS.

You'll probably want to replicate the data off first of course.

Thanks,

Tom

On Mon, Jun 4, 2012 at 2:06 PM, Pat Ferrel p...@occamsmachete.com wrote:
 I have a machine that is part of the cluster but I'd like to dedicate it to
 being the web server and run the db but still have access to starting jobs
 and getting data out of hdfs. In other words I'd like to have the cores,
 memory, and disk only minimally affected by running jobs on the cluster yet
 still have easy access when I need to get data out.

 I assume I can do something like set the max number of jobs for the node to
 0 and something similar for hdfs? Is there a recommended way to go about
 this?


Re: mini node in a cluster

2012-06-04 Thread Tom Melendez
Hi Pat,

 Sounds like the trick. This node is a slave so it's datanode and tasktracker
 are started from the master.
   - how do I start the cluster without starting the datanode and the
 tasktracker on the mini-node slave? Remove it from slaves?

There's no main cluster software, just don't start those services.
If you're on Linux and have init.d scripts, look for the ones that are
appended with datanode and tasktracker.

   - what do I minimally need to start on the mini-node?


Nothing except the hadoop jars.  The presence of the config files in
your CLASSPATH is all you need to talk to your cluster.  So, if you
can run hadoop dfs -ls /some/path/in/hdfs and it succeeds, you're
probably OK.

 Also I have replication set to 2 so the data will just get re-replicated
 once the mini-node is reconfigured, right? There should be another copy
 somewhere on the cluster.


Probably.

It's not really a mini-node, it's really just a client at this
point, it's not known by your cluster.  You could configure your
laptop or any other machine to do the same thing, for example.

Thanks,

Tom


Re: how to unit test my RawComparator

2012-03-31 Thread Tom Melendez
Hi Chris and all, hope you don't mind if I inject a question in here.
It's highly related IMO (famous last words).

On Sat, Mar 31, 2012 at 2:18 PM, Chris White chriswhite...@gmail.com wrote:
 You can serialize your Writables to a ByteArrayOutputStream and then
 get it's underlying byte array:

 ByteArrayOutputStream baos = new ByteArrayOutputStream();
 DataOutputStream dos = new DataOutputStream(baos);
 Writable myWritable = new Text(text);
 myWritable.write(dos);
 byte[] bytes = baos.toByteArray();


I popped in this into a quick test and it failed.  What I want are the
exact bytes back from the Writable (in my case, BytesWritable).  So,
this fails for me:

@Test
public void byteswritabletest() {
ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
BytesWritable myBW = new BytesWritable(test.getBytes());
try {
myBW.write(dos);
} catch (IOException e) {
e.printStackTrace();
}
byte[] bytes = baos.toByteArray();
assertEquals(test.getBytes().length, bytes.length);  //I get
expected: 4, actual 8 with this assertion
}


I see that in new versions of Text and BytesWritable, there is a
.copyBytes() method that is available that gives us that.
https://reviews.apache.org/r/182/diff/

Is there another way (without the upgrade) to achieve that?

Thanks,

Tom


Re: I am trying to run a large job and it is consistently failing with timeout - nothing happens for 600 sec

2012-01-18 Thread Tom Melendez
Sounds like mapred.task.timeout?  The default is 10 minutes.

http://hadoop.apache.org/common/docs/current/mapred-default.html

Thanks,

Tom

On Wed, Jan 18, 2012 at 2:05 PM, Steve Lewis lordjoe2...@gmail.com wrote:
 The map tasks fail timing out after 600 sec.
 I am processing one 9 GB file with 16,000,000 records. Each record (think
 is it as a line)  generates hundreds of key value pairs.
 The job is unusual in that the output of the mapper in terms of records or
 bytes orders of magnitude larger than the input.
 I have no idea what is slowing down the job except that the problem is in
 the writes.

 If I change the job to merely bypass a fraction of the context.write
 statements the job succeeds.
 This is one map task that failed and one that succeeded - I cannot
 understand how a write can take so long
 or what else the mapper might be doing

 JOB FAILED WITH TIMEOUT

 *Parser*TotalProteins90,103NumberFragments10,933,089
 *FileSystemCounters*HDFS_BYTES_READ67,245,605FILE_BYTES_WRITTEN444,054,807
 *Map-Reduce Framework*Combine output records10,033,499Map input records
 90,103Spilled Records10,032,836Map output bytes3,520,182,794Combine input
 records10,844,881Map output records10,933,089
 Same code but fewer writes
 JOB SUCCEEDED

 *Parser*TotalProteins90,103NumberFragments206,658,758
 *FileSystemCounters*FILE_BYTES_READ111,578,253HDFS_BYTES_READ67,245,607
 FILE_BYTES_WRITTEN220,169,922
 *Map-Reduce Framework*Combine output records4,046,128Map input
 records90,103Spilled
 Records4,046,128Map output bytes662,354,413Combine input records4,098,609Map
 output records2,066,588
 Any bright ideas
 --
 Steven M. Lewis PhD
 4221 105th Ave NE
 Kirkland, WA 98033
 206-384-1340 (cell)
 Skype lordjoe_com


Re: Question about accessing another HDFS

2011-12-08 Thread Tom Melendez
I'm hoping there is a better answer, but I'm thinking you could load
another configuration file (with B.company in it) using Configuration,
grab a FileSystem obj with that and then go forward.  Seems like some
unnecessary overhead though.

Thanks,

Tom

On Thu, Dec 8, 2011 at 2:42 PM, Frank Astier fast...@yahoo-inc.com wrote:
 Hi -

 We have two namenodes set up at our company, say:

 hdfs://A.mycompany.com
 hdfs://B.mycompany.com

 From the command line, I can do:

 Hadoop fs –ls hdfs://A.mycompany.com//some-dir

 And

 Hadoop fs –ls hdfs://B.mycompany.com//some-other-dir

 I’m now trying to do the same from a Java program that uses the HDFS API. No 
 luck there. I get an exception: “Wrong FS”.

 Any idea what I’m missing in my Java program??

 Thanks,

 Frank


Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez
So that code 126 should be kicked out by your program - do you know
what that means?

Your code can read from stdin?

Thanks,

Tom

On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
dtyehd...@miners.utep.edu wrote:

 I have the following error in running hadoop streaming,
 PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        at 
 org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
   at 
 org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
      at 
 org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132)      
 at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)      at 
 org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)   
 at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)      
   at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
 org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
 I couldn't find out any other error information.
 Any help ?



Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez
Hi Daniel,

I see from your other thread that your HADOOP script has a line like:

#!/bin/shrm -f temp.txt

I'm not sure what that is, exactly.  I suspect the -f is reading from
some file and the while loop you had listed read from stdin it seems.

What does your input look like?  I think what's happening is that you
might be expecting lines of input and you're getting splits.  What
does your input look like?

You might want to try this:
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

Thanks,

Tom




On Sat, Dec 3, 2011 at 7:22 PM, Daniel Yehdego
dtyehd...@miners.utep.edu wrote:

 Thanks Tom for your reply,
 I think my code is reading from stdin. Because I tried it locally using the 
 following command and its running:
  $ bin/hadoop fs -cat 
 /user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt | 
 head -2 | ./HADOOP

 But when I tried streaming , it failed and gave me the error code 126.

 Date: Sat, 3 Dec 2011 19:14:20 -0800
 Subject: Re: Hadoop Streaming
 From: t...@supertom.com
 To: common-user@hadoop.apache.org

 So that code 126 should be kicked out by your program - do you know
 what that means?

 Your code can read from stdin?

 Thanks,

 Tom

 On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
 dtyehd...@miners.utep.edu wrote:
 
  I have the following error in running hadoop streaming,
  PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        at 
  org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
    at 
  org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
       at 
  org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132)    
    at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57)      
  at 
  org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36) 
    at org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358) 
         at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
  org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
  I couldn't find out any other error information.
  Any help ?
 



Re: Hadoop Streaming

2011-12-03 Thread Tom Melendez
Oh, I see the line wrapped.  My bad.

Either way, I think the NLineInputFormat is what you need.  I'm
assuming you want one line of input to execute on one mapper.

Thanks,

Tom

On Sat, Dec 3, 2011 at 7:57 PM, Daniel Yehdego
dtyehd...@miners.utep.edu wrote:

 TOM,
 What the HADOOP script do is ...read each line from the STDIN and execute the 
 program pknotsRG. tmp.txt is a temporary file.
 the script is like this:
    #!/bin/sh
    rm -f temp.txt;    while read line
   do    echo $line  temp.txt;    done    exec 
 /data/yehdego/hadoop-0.20.2/PKNOTSRG/src/pknotsRG -k 0 -F temp.txt;

 Date: Sat, 3 Dec 2011 19:49:46 -0800
 Subject: Re: Hadoop Streaming
 From: t...@supertom.com
 To: common-user@hadoop.apache.org

 Hi Daniel,

 I see from your other thread that your HADOOP script has a line like:

 #!/bin/shrm -f temp.txt

 I'm not sure what that is, exactly.  I suspect the -f is reading from
 some file and the while loop you had listed read from stdin it seems.

 What does your input look like?  I think what's happening is that you
 might be expecting lines of input and you're getting splits.  What
 does your input look like?

 You might want to try this:
 -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat

 Thanks,

 Tom




 On Sat, Dec 3, 2011 at 7:22 PM, Daniel Yehdego
 dtyehd...@miners.utep.edu wrote:
 
  Thanks Tom for your reply,
  I think my code is reading from stdin. Because I tried it locally using 
  the following command and its running:
   $ bin/hadoop fs -cat 
  /user/yehdego/Hadoop-Data-New/RF00171_A.bpseqL3G1_seg_Optimized_Method.txt 
  | head -2 | ./HADOOP
 
  But when I tried streaming , it failed and gave me the error code 126.
 
  Date: Sat, 3 Dec 2011 19:14:20 -0800
  Subject: Re: Hadoop Streaming
  From: t...@supertom.com
  To: common-user@hadoop.apache.org
 
  So that code 126 should be kicked out by your program - do you know
  what that means?
 
  Your code can read from stdin?
 
  Thanks,
 
  Tom
 
  On Sat, Dec 3, 2011 at 7:09 PM, Daniel Yehdego
  dtyehd...@miners.utep.edu wrote:
  
   I have the following error in running hadoop streaming,
   PipeMapRed\.waitOutputThreads(): subprocess failed with code 126        
   at 
   org\.apache\.hadoop\.streaming\.PipeMapRed\.waitOutputThreads(PipeMapRed\.java:311)
 at 
   org\.apache\.hadoop\.streaming\.PipeMapRed\.mapRedFinished(PipeMapRed\.java:545)
    at 
   org\.apache\.hadoop\.streaming\.PipeMapper\.close(PipeMapper\.java:132) 
        at org\.apache\.hadoop\.mapred\.MapRunner\.run(MapRunner\.java:57) 
        at 
   org\.apache\.hadoop\.streaming\.PipeMapRunner\.run(PipeMapRunner\.java:36)
  at 
   org\.apache\.hadoop\.mapred\.MapTask\.runOldMapper(MapTask\.java:358)   
        at org\.apache\.hadoop\.mapred\.MapTask\.run(MapTask\.java:307) at 
   org\.apache\.hadoop\.mapred\.Child\.main(Child\.java:170)
   I couldn't find out any other error information.
   Any help ?
  
 



Re: How do I programmatically get total job execution time?

2011-12-02 Thread Tom Melendez
On Fri, Dec 2, 2011 at 9:57 AM, W.P. McNeill bill...@gmail.com wrote:
 After my Hadoop job has successfully completed I'd like to log the total
 amount of time it took. This is the Finished in statistic in the web UI.
 How do I get this number programmatically? Is there some way I can query
 the Job object? I didn't see anything in the API documentation.

This probably *doesn't* help you, but if you're using (or planning on
using) oozie, it has a restful API that can give you this information.

Thanks,

Tom


questions regarding data storage and inputformat

2011-07-27 Thread Tom Melendez
Hi Folks,

I have a bunch of binary files which I've stored in a sequencefile.
The name of the file is the key, the data is the value and I've stored
them sorted by key.  (I'm not tied to using a sequencefile for this).
The current test data is only 50MB, but the real data will be 500MB -
1GB.

My M/R job requires that it's input be several of these records in the
sequence file, which is determined by the key.  The sorting mentioned
above keeps these all packed together.

1. Any reason not to use a sequence file for this?  Perhaps a mapfile?
 Since I've sorted it, I don't need random accesses, but I do need
to be aware of the keys, as I need to be sure that I get all of the
relevant keys sent to a given mapper

2. Looks like I want a custom inputformat for this, extending
SequenceFileInputFormat.  Do you agree?  I'll gladly take some
opinions on this, as I ultimately want to split the based on what's in
the file, which might be a little unorthodox.

3. Another idea might be create separate seq files for chunk of
records and make them non-splittable, ensuring that they go to a
single mapper.  Assuming I can get away with this, see any pros/cons
with that approach?

Thanks,

Tom

-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: questions regarding data storage and inputformat

2011-07-27 Thread Tom Melendez

 3. Another idea might be create separate seq files for chunk of
 records and make them non-splittable, ensuring that they go to a
 single mapper.  Assuming I can get away with this, see any pros/cons
 with that approach?

 Separate sequence files would require the least amount of custom code.


Thanks for the response, Joey.

So, if I were to do the above, I would still need a custom record
reader to put all the keys and values together, right?

Thanks,

Tom

-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: Custom FileOutputFormat / RecordWriter

2011-07-26 Thread Tom Melendez
Hi Harsh,

Cool, thanks for the details.  For anyone interested, with your tip
and description I was able to find an example inside the Hadoop in
Action (Chapter 7, p168) book.

Another question, though, it doesn't look like MultipleOutputs will
let me control the filename in a per-key (per map) manner.  So,
basically, if my map receives a key of mykey, I want my file to be
mykey-someotherstuff.foo (this is a binary file).  Am I right about
this?

Thanks,

Tom

On Tue, Jul 26, 2011 at 1:34 AM, Harsh J ha...@cloudera.com wrote:
 Tom,

 What I meant to say was that doing this is well supported with
 existing API/libraries itself:

 - The class MultipleOutputs supports providing a filename for an
 output. See MultipleOutputs.addNamedOutput usage [1].
 - The type 'NullWritable' is a special writable that doesn't do
 anything. So if its configured into the above filename addition as a
 key-type, and you pass NullWritable.get() as the key in every write
 operation, you will end up just writing the value part of (key,
 value).
 - This way you do not have to write a custom OutputFormat for your use-case.

 [1] - 
 http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html
 (Also available for the new API, depending on which
 version/distribution of Hadoop you are on)

 On Tue, Jul 26, 2011 at 3:36 AM, Tom Melendez t...@supertom.com wrote:
 Hi Harsh,

 Thanks for the response.  Unfortunately, I'm not following your response.  
 :-)

 Could you elaborate a bit?

 Thanks,

 Tom

 On Mon, Jul 25, 2011 at 2:10 PM, Harsh J ha...@cloudera.com wrote:
 You can use MultipleOutputs (or MultiTextOutputFormat for direct
 key-file mapping, but I'd still prefer the stable MultipleOutputs).
 Your sinking Key can be of NullWritable type, and you can keep passing
 an instance of NullWritable.get() to it in every cycle. This would
 write just the value, while the filenames are added/sourced from the
 key inside the mapper code.

 This, if you are not comfortable writing your own code and maintaining
 it, I s'pose. Your approach is correct as well, if the question was
 specifically that.

 On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 Just doing a sanity check here.

 I have a map-only job, which produces a filename for a key and data as
 a value.  I want to write the value (data) into the key (filename) in
 the path specified when I run the job.

 The value (data) doesn't need any formatting, I can just write it to
 HDFS without modification.

 So, looking at this link (the Output Formats section):

 http://developer.yahoo.com/hadoop/tutorial/module5.html

 Looks like I want to:
 - create a new output format
 - override write, tell it not to call writekey as I don't want that written
 - new getRecordWriter method that use the key as the filename and
 calls my outputformat

 Sound reasonable?

 Thanks,

 Tom

 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs




 --
 Harsh J




 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs




 --
 Harsh J




-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Robert,

In this specific case, that's OK.  I'll never write to the same file
from two different mappers.  Otherwise, think it's cool?  I haven't
played with the outputformat before.

Thanks,

Tom

On Mon, Jul 25, 2011 at 1:30 PM, Robert Evans ev...@yahoo-inc.com wrote:
 Tom,

 That assumes that you will never write to the same file from two different 
 mappers or processes.  HDFS currently does not support writing to a single 
 file from multiple processes.

 --Bobby

 On 7/25/11 3:25 PM, Tom Melendez t...@supertom.com wrote:

 Hi Folks,

 Just doing a sanity check here.

 I have a map-only job, which produces a filename for a key and data as
 a value.  I want to write the value (data) into the key (filename) in
 the path specified when I run the job.

 The value (data) doesn't need any formatting, I can just write it to
 HDFS without modification.

 So, looking at this link (the Output Formats section):

 http://developer.yahoo.com/hadoop/tutorial/module5.html

 Looks like I want to:
 - create a new output format
 - override write, tell it not to call writekey as I don't want that written
 - new getRecordWriter method that use the key as the filename and
 calls my outputformat

 Sound reasonable?

 Thanks,

 Tom

 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs





-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Bobby,

Yeah, that won't be a big deal in this case.  It will create about 40
files, each about 60MB each.  This job is kind of an odd one that
won't be run very often.

Thanks,

Tom

On Mon, Jul 25, 2011 at 1:34 PM, Robert Evans ev...@yahoo-inc.com wrote:
 Tom,

 I also forgot to mention that if you are writing to lots of little files it 
 could cause issues too.  HDFS is designed to handle relatively few BIG files. 
  There is some work to improve this, but it is still a ways off.  So it is 
 likely going to be very slow and put a big load on the namenode if you are 
 going to create lot of small files using this method.

 --Bobby


 On 7/25/11 3:30 PM, Robert Evans ev...@yahoo-inc.com wrote:

 Tom,

 That assumes that you will never write to the same file from two different 
 mappers or processes.  HDFS currently does not support writing to a single 
 file from multiple processes.

 --Bobby

 On 7/25/11 3:25 PM, Tom Melendez t...@supertom.com wrote:

 Hi Folks,

 Just doing a sanity check here.

 I have a map-only job, which produces a filename for a key and data as
 a value.  I want to write the value (data) into the key (filename) in
 the path specified when I run the job.

 The value (data) doesn't need any formatting, I can just write it to
 HDFS without modification.

 So, looking at this link (the Output Formats section):

 http://developer.yahoo.com/hadoop/tutorial/module5.html

 Looks like I want to:
 - create a new output format
 - override write, tell it not to call writekey as I don't want that written
 - new getRecordWriter method that use the key as the filename and
 calls my outputformat

 Sound reasonable?

 Thanks,

 Tom

 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs






-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: Custom FileOutputFormat / RecordWriter

2011-07-25 Thread Tom Melendez
Hi Harsh,

Thanks for the response.  Unfortunately, I'm not following your response.  :-)

Could you elaborate a bit?

Thanks,

Tom

On Mon, Jul 25, 2011 at 2:10 PM, Harsh J ha...@cloudera.com wrote:
 You can use MultipleOutputs (or MultiTextOutputFormat for direct
 key-file mapping, but I'd still prefer the stable MultipleOutputs).
 Your sinking Key can be of NullWritable type, and you can keep passing
 an instance of NullWritable.get() to it in every cycle. This would
 write just the value, while the filenames are added/sourced from the
 key inside the mapper code.

 This, if you are not comfortable writing your own code and maintaining
 it, I s'pose. Your approach is correct as well, if the question was
 specifically that.

 On Tue, Jul 26, 2011 at 1:55 AM, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 Just doing a sanity check here.

 I have a map-only job, which produces a filename for a key and data as
 a value.  I want to write the value (data) into the key (filename) in
 the path specified when I run the job.

 The value (data) doesn't need any formatting, I can just write it to
 HDFS without modification.

 So, looking at this link (the Output Formats section):

 http://developer.yahoo.com/hadoop/tutorial/module5.html

 Looks like I want to:
 - create a new output format
 - override write, tell it not to call writekey as I don't want that written
 - new getRecordWriter method that use the key as the filename and
 calls my outputformat

 Sound reasonable?

 Thanks,

 Tom

 --
 ===
 Skybox is hiring.
 http://www.skyboximaging.com/careers/jobs




 --
 Harsh J




-- 
===
Skybox is hiring.
http://www.skyboximaging.com/careers/jobs


Re: tips and tools to optimize cluster

2011-05-24 Thread Tom Melendez
Thanks Chris, these are quite helpful.

Thanks,

Tom

On Tue, May 24, 2011 at 11:13 AM, Chris Smith csmi...@gmail.com wrote:
 Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
 precision on the historical data.
 It also has some neat tracks around the collection and display of data.

 Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
 which is a light weight Perl script that
 both captures and compresses the metrics, manages it's metrics data
 files and then filters and presents
 the metrics as requested.

 I find collectl lightweight and useful enough that I set it up to
 capture everything and
 then leave it running in the background on most systems I build
 because when you need the measurement
 data the event is usually in the past and difficult to reproduce.
 With collectl running I have a week to
 recognise the event and analyse/save the relevant data file(s); data
 file approx. 21MB/node/day gzipped.

 With a little bit of bash or awk or perl scripting you can convert the
 collectl output into a form easily
 loadable into Pig.  Pig also has User Defined Functions (UDFs) that
 can import the Hadoop job history so
 with some Pig Latin you can marry your infrastructure metrics with
 your job metrics; a bit like the cluster
 eating it own dog food.

 BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
 report the full jvm metrics via gmond
 although if you output the jvm metrics to file you get a record for
 each jvm on the node.  I haven't looked
 into it in detail yet but it looks like Gangla only reports the last
 jvm record in each batch. Anyone else seen
 this?

 Chris

 On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 I'm looking for tips, tricks and tools to get at node utilization to
 optimize our cluster.  I want answer questions like:
 - what nodes ran a particular job?
 - how long did it take for those nodes to run the tasks for that job?
 - how/why did Hadoop pick those nodes to begin with?

 More detailed questions like
 - how much memory did the task for the job use on that node?
 - average CPU load on that node during the task run

 And more aggregate questions like:
 - are some nodes favored more than others?
 - utilization averages (generally, how many cores on that node are in use, 
 etc.)

 There are plenty more that I'm not asking, but you get the point?  So,
 what are you guys using for this?

 I see some mentions of Ganglia, so I'll definitely look into that.
 Anything else?  Anything you're using to monitor in real-time (like a
 'top' across the nodes or something like that)?

 Any info or war-stories greatly appreciated.

 Thanks,

 Tom




tips and tools to optimize cluster

2011-05-23 Thread Tom Melendez
Hi Folks,

I'm looking for tips, tricks and tools to get at node utilization to
optimize our cluster.  I want answer questions like:
- what nodes ran a particular job?
- how long did it take for those nodes to run the tasks for that job?
- how/why did Hadoop pick those nodes to begin with?

More detailed questions like
- how much memory did the task for the job use on that node?
- average CPU load on that node during the task run

And more aggregate questions like:
- are some nodes favored more than others?
- utilization averages (generally, how many cores on that node are in use, etc.)

There are plenty more that I'm not asking, but you get the point?  So,
what are you guys using for this?

I see some mentions of Ganglia, so I'll definitely look into that.
Anything else?  Anything you're using to monitor in real-time (like a
'top' across the nodes or something like that)?

Any info or war-stories greatly appreciated.

Thanks,

Tom


Re: Linker errors with Hadoop pipes

2011-05-19 Thread Tom Melendez
I'm on Ubuntu and use pipes.  These are my ssl packages, notice libssl
and libssl-dev in particular:

supertom@hadoop-2:~/h-v8$ dpkg -l |grep -i ssl
ii  libopenssl-ruby 4.2
OpenSSL interface for Ruby
ii  libopenssl-ruby1.8  1.8.7.249-2
OpenSSL interface for Ruby 1.8
ii  libssl-dev  0.9.8k-7ubuntu8.6
SSL development libraries, header files and
ii  libssl0.9.8 0.9.8k-7ubuntu8.6
SSL shared libraries
ii  openssl 0.9.8k-7ubuntu8
Secure Socket Layer (SSL) binary and related
ii  python-openssl  0.10-1
Python wrapper around the OpenSSL library
ii  ssl-cert1.0.23ubuntu2
simple debconf wrapper for OpenSSL

Hope that helps,

Thanks,

Tom

On Thu, May 19, 2011 at 3:28 PM, tdp2110 thomas.d.pet...@gmail.com wrote:

 n00b here, just started playing around with pipes. I'm getting linker errors
 while compiling a simple WordCount example using hadoop-0.20.203 (current
 most recent version) that did not appear for the same code in hadoop-0.20.2

 Linker errors of the form: undefined reference to `EVP_sha1' in
 HadoopPipes.cc.

 EVP_sha1 (and all of the undefined references I get) are part of the openssl
 library which HadoopPipes.cc from hadoop-0.20.203 uses, but hadoop-0.20.2
 does not.

 I've tried adjusting my makefile to link to the ssl libraries, but I'm still
 out of luck. Any ideas would be greatly appreciated. Thanks!

 PS, here is my current makefile:
 CC = g++
 HADOOP_INSTALL = /usr/local/hadoop-0.20.203.0
 SSL_INSTALL = /usr/local/ssl
 PLATFORM = Linux-amd64-64
 CPPFLAGS = -m64 -I$(HADOOP_INSTALL)/c++/$(PLATFORM)/include
 -I$(SSL_INSTALL)/include

 WordCount: WordCount.cc
    $(CC) $(CPPFLAGS) $ -Wall -Wextra -L$(SSL_INSTALL)/lib -lssl -lcrypto
 -L$(HADOOP_INSTALL)/c++/$(PLATFORM)/lib -lhadooppipes -lhadooputils
 -lpthread -g -O2 -o $@

 --
 View this message in context: 
 http://old.nabble.com/Linker-errors-with-Hadoop-pipes-tp31634596p31634596.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




passing classpath through to datanodes?

2011-05-06 Thread Tom Melendez
Hi Folks,

I'm having trouble getting a custom classpath through to the datanodes
in my cluster.

I'm using libhdfs and pipes, and the hdfsConnect call in libhdfs
requires that the classpath is set.  My code executes fine on a
standalone machine, but when I take to the cluster, I can see that the
classpath is not set, as the error is emitted into the logs.

I'm mucked around with the hadoop-env.sh file and restarted the
tasktracker and datanode, but since I'm new to tinkering with this
file, I'm hoping that someone here with can help me with the steps
getting my classpath set correctly.  Maybe hadoop-env.sh is NOT the
right way to do this?

Thanks,

Tom