Job Tracker/Name Node redundancy

2009-01-09 Thread Ryan LeCompte
Are there any plans to build redundancy/failover support for the Job
Tracker and Name Node components in Hadoop? Let's take the current
scenario:

1) A data/cpu intensive job is submitted to a Hadoop cluster of 10 machines.
2) Half-way through the job execution, the Job Tracker or Name Node fails.
3) We bring up a new Job Tracker or Name Node manually.

-- Will the individual task trackers / data nodes reconnect to the
new masters? Or will the job have to be resubmitted? If we had
failover support, we could setup essentially 3 Job Tracker masters and
3 NameNode masters so that if one dies the other would gracefully take
over and start handling results from the children nodes.

Thanks!

Ryan


EC2 Usage?

2008-12-18 Thread Ryan LeCompte
Hello all,

Somewhat of a an off-topic related question, but I know there are
Hadoop + EC2 users here. Does anyone know if there is a programmatic
API to get find out how many machine time hours have been used by a
Hadoop cluster (or anything) running on EC2? I know that you can log
into the EC2 web site and see this, but I'm wondering if there's a way
to access this data programmaticly via web services?

Thanks,
Ryan


Re: EC2 Usage?

2008-12-18 Thread Ryan LeCompte
Thanks!

On Thu, Dec 18, 2008 at 12:17 PM, Tom White t...@cloudera.com wrote:
 Hi Ryan,

 The ec2-describe-instances command in the API tool reports the launch
 time for each instance, so you could work out the machine hours of
 your cluster using that information.

 Tom

 On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote:
 Hello all,

 Somewhat of a an off-topic related question, but I know there are
 Hadoop + EC2 users here. Does anyone know if there is a programmatic
 API to get find out how many machine time hours have been used by a
 Hadoop cluster (or anything) running on EC2? I know that you can log
 into the EC2 web site and see this, but I'm wondering if there's a way
 to access this data programmaticly via web services?

 Thanks,
 Ryan




Re: Streaming data into Hadoop

2008-12-09 Thread Ryan LeCompte

Even better! I'll try this out tomorrow.

Thanks,
Ryan




On Dec 9, 2008, at 10:36 PM, Aaron Kimball [EMAIL PROTECTED] wrote:

Note also that cat foo | bin/hadoop fs -put - some/hdfs/path will  
use

stdin.
- Aaron

On Mon, Dec 8, 2008 at 5:56 PM, Ryan LeCompte [EMAIL PROTECTED]  
wrote:



Just what I need -- thanks!

On Mon, Dec 8, 2008 at 7:31 PM, Alex Loddengaard [EMAIL PROTECTED]
wrote:

This should answer your questions:

http://wiki.apache.org/hadoop/MountableHDFS

Alex

On Mon, Dec 8, 2008 at 2:19 PM, Ryan LeCompte [EMAIL PROTECTED]

wrote:



Hello all,

I normally upload files into hadoop via bin/hadoop fs -put file  
dest.


However, is there a way to somehow stream data into Hadoop?

For example, I'd love to do something like this:

zcat xxx  HADOOP_HDFS_DESTINATION

This would save me a ton of time since I don't have to first unpack
the .tgz file and upload the raw file into HDFS.

Is this possible with Hadoop 0.19?

Thanks,
Ryan







Streaming data into Hadoop

2008-12-08 Thread Ryan LeCompte
Hello all,

I normally upload files into hadoop via bin/hadoop fs -put file dest.

However, is there a way to somehow stream data into Hadoop?

For example, I'd love to do something like this:

zcat xxx  HADOOP_HDFS_DESTINATION

This would save me a ton of time since I don't have to first unpack
the .tgz file and upload the raw file into HDFS.

Is this possible with Hadoop 0.19?

Thanks,
Ryan


Re: Streaming data into Hadoop

2008-12-08 Thread Ryan LeCompte
Just what I need -- thanks!

On Mon, Dec 8, 2008 at 7:31 PM, Alex Loddengaard [EMAIL PROTECTED] wrote:
 This should answer your questions:

 http://wiki.apache.org/hadoop/MountableHDFS

 Alex

 On Mon, Dec 8, 2008 at 2:19 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 Hello all,

 I normally upload files into hadoop via bin/hadoop fs -put file dest.

 However, is there a way to somehow stream data into Hadoop?

 For example, I'd love to do something like this:

 zcat xxx  HADOOP_HDFS_DESTINATION

 This would save me a ton of time since I don't have to first unpack
 the .tgz file and upload the raw file into HDFS.

 Is this possible with Hadoop 0.19?

 Thanks,
 Ryan




Re: stack trace from hung task

2008-12-05 Thread Ryan LeCompte
For what it's worth, I started seeing these when I upgraded to 0.19. I
was using 10 reduces, but changed it to 30 reduces for my job and now
I don't see these errors any more.

Thanks,
Ryan


On Fri, Dec 5, 2008 at 2:44 PM, Sriram Rao [EMAIL PROTECTED] wrote:
 Hi,

 When a task tracker kills a non-responsive task, it prints out a
 message Task X not reported status for 600 seconds. Killing!.
 The stack trace it then dumps out is that of the task tracker itself.
 Is there a way to get the hung task to dump out its stack trace before
 exiting?  Would be nice if there was an easy way to send a kill -3 to
 the hung process and then kill it.

 Sriram



Hadoop balancer

2008-12-03 Thread Ryan LeCompte
I've tried running the bin/hadoop balance command since I recently
added a new node to the Hadoop cluster. I noticed the following output
in the beginning:

08/12/03 10:26:35 INFO balancer.Balancer: Will move 10 GBbytes in this iteration
Dec 3, 2008 10:26:35 AM   0 0 KB
2.67 GB  10 GB
08/12/03 10:26:36 WARN balancer.Balancer: Error moving block
-6653850537520285828 from 10.30.2.31:50010 to 10.30.2.23:50010 through
10.30.2.31:50010: block move is failed

However, subsequent iterations have succeeded. Is this harmless or is
there a possibility that data was corrupted during the re-balancing
attempt?

Thanks,
Ryan


Hadoop and .tgz files

2008-12-01 Thread Ryan LeCompte
Hello all,

I'm using Hadoop 0.19 and just discovered that it has no problems
processing .tgz files that contain text files. I was under the
impression that it wouldn't be able to break a .tgz file up into
multiple maps, but instead just treat it as 1 map per .tgz file. Was
this a recent change or enhancement? I'm noticing that it is breaking
up the .tgz file into multiple maps.

Thanks,
Ryan


Re: Hadoop and .tgz files

2008-12-01 Thread Ryan LeCompte
I believe I spoke a little too soon. Looks like Hadoop supports .gz
files, not .tgz. :-)


On Mon, Dec 1, 2008 at 10:46 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm using Hadoop 0.19 and just discovered that it has no problems
 processing .tgz files that contain text files. I was under the
 impression that it wouldn't be able to break a .tgz file up into
 multiple maps, but instead just treat it as 1 map per .tgz file. Was
 this a recent change or enhancement? I'm noticing that it is breaking
 up the .tgz file into multiple maps.

 Thanks,
 Ryan



Re: Question regarding reduce tasks

2008-11-03 Thread Ryan LeCompte
What happens when the reducer task gets invoked more than once? My
guess is once a reducer task finishes writing the data for a
particular key to HDFS, it won't somehow get re-executed again for the
same key right?


On Mon, Nov 3, 2008 at 11:28 AM, Miles Osborne [EMAIL PROTECTED] wrote:
 you can't guarantee that a reducer (or mapper for that matter) will be
 executed exactly once unless you turn-off preemptive scheduling.  but,
 a distinct key gets sent to a single reducer, so yes, only one reducer
 will see a particulat key + associated values

 Miles

 2008/11/3 Ryan LeCompte [EMAIL PROTECTED]:
 Hello,

 Is it safe to assume that only one reduce task will ever operate on
 values for a particular key? Or is it possible that more than one
 reduce task can work on values for the same key? The reason I ask is
 because I want to ensure that a piece of code that I write at the end
 of my reducer method will only ever be executed once after all values
 for a particular key are aggregated/summed.

 Thanks,
 Ryan




 --
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.



Re: NotYetReplicated exceptions when pushing large files into HDFS

2008-09-23 Thread Ryan LeCompte
Thanks. Is there a way to increase the retry amount?

Ryan

On Mon, Sep 22, 2008 at 8:21 PM, lohit [EMAIL PROTECTED] wrote:
 Yes, these are warning unless they fail for 3 times. In which case your dfs 
 -put command would fail with stack trace.
 Thanks,
 Lohit



 - Original Message 
 From: Ryan LeCompte [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org core-user@hadoop.apache.org
 Sent: Monday, September 22, 2008 5:18:01 PM
 Subject: Re: NotYetReplicated exceptions when pushing large files into HDFS

 I've noticed that although I get a few of these exceptions, the file
 is ultimately uploaded to the HDFS cluster. Does this mean that my
 file ended up getting there in 1 piece? The exceptions are just logged
 at the WARN level and indicate retry attempts.

 Thanks,
 Ryan


 On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'd love to be able to upload into HDFS very large files (e.g., 8 or
 10GB), but it seems like my only option is to chop up the file into
 smaller pieces. Otherwise, after a while I get NotYetReplication
 exceptions while the transfer is in progress. I'm using 0.18.1. Is
 there any way I can do this? Perhaps use something else besides
 bin/hadoop -put input output?

 Thanks,
 Ryan





Re: NotYetReplicated exceptions when pushing large files into HDFS

2008-09-22 Thread Ryan LeCompte
I've noticed that although I get a few of these exceptions, the file
is ultimately uploaded to the HDFS cluster. Does this mean that my
file ended up getting there in 1 piece? The exceptions are just logged
at the WARN level and indicate retry attempts.

Thanks,
Ryan


On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'd love to be able to upload into HDFS very large files (e.g., 8 or
 10GB), but it seems like my only option is to chop up the file into
 smaller pieces. Otherwise, after a while I get NotYetReplication
 exceptions while the transfer is in progress. I'm using 0.18.1. Is
 there any way I can do this? Perhaps use something else besides
 bin/hadoop -put input output?

 Thanks,
 Ryan



Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan


Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-20 Thread Ryan LeCompte
Yes I did, but that didn't solve my problem since I'm working with a  
fairly large data set (8gb).


Thanks,
Ryan




On Sep 21, 2008, at 12:22 AM, Sandy [EMAIL PROTECTED] wrote:

Have you increased the heapsize in conf/hadoop-env.sh to 2000? This  
helped

me some, but eventually I had to upgrade to a system with more memory.

-SM


On Sat, Sep 20, 2008 at 9:07 PM, Ryan LeCompte [EMAIL PROTECTED]  
wrote:



Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?

Thanks!

Ryan



Re: Why can't Hadoop be used for online applications ?

2008-09-12 Thread Ryan LeCompte
Hadoop is best suited for distributed processing across many machines
of large data sets. Most people use Hadoop to plow through large data
sets in an offline fashion. One approach that you can use is to use
Hadoop to process your data, then put it in an optimized form in HBase
(i.e., similar to Google's Bigtable). Then, you can use HBase for
querying the data in an online-access fashion. Refer to
http://hadoop.apache.org/hbase/ for more information about HBase.

Ryan


On Fri, Sep 12, 2008 at 2:46 PM, souravm [EMAIL PROTECTED] wrote:
 Hi,

 Here is a bsic doubt.

 I found in different documentation it is mentioned that Hadoop is not 
 recommended for online applications. Can anyone please elaborate on the same ?

 Regards,
 Sourav

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely
 for the use of the addressee(s). If you are not the intended recipient, please
 notify the sender by e-mail and delete the original message. Further, you are 
 not
 to copy, disclose, or distribute this e-mail or its contents to any other 
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has 
 taken
 every reasonable precaution to minimize this risk, but is not liable for any 
 damage
 you may sustain as a result of any virus in this e-mail. You should carry out 
 your
 own virus checks before opening the e-mail or attachment. Infosys reserves the
 right to monitor and review the content of all messages sent to or from this 
 e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***



Re: Why can't Hadoop be used for online applications ?

2008-09-12 Thread Ryan LeCompte
Hey Camilo,

HBase is not meant to be a replacement for MySQL or a traditional
RDBMS (HBase is not transaction, for instance). I'd recommend reading
the following article that describes what HBase/Bigtable really is:

http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable

Thanks,
Ryan


On Fri, Sep 12, 2008 at 3:25 PM, Camilo Gonzalez [EMAIL PROTECTED] wrote:
 Hi Ryan!

 Does this means that HBase could be used for Online applications, for
 example, replacing MySQL in database-driven applications?

 Does anyone have any kind of benchmarks about the comparison between MySQL
 queries/updates and HBase queries/updates?

 Have a nice day,

 Camilo.

 On Fri, Sep 12, 2008 at 1:55 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 Hadoop is best suited for distributed processing across many machines
 of large data sets. Most people use Hadoop to plow through large data
 sets in an offline fashion. One approach that you can use is to use
 Hadoop to process your data, then put it in an optimized form in HBase
 (i.e., similar to Google's Bigtable). Then, you can use HBase for
 querying the data in an online-access fashion. Refer to
 http://hadoop.apache.org/hbase/ for more information about HBase.

 Ryan


 On Fri, Sep 12, 2008 at 2:46 PM, souravm [EMAIL PROTECTED] wrote:
  Hi,
 
  Here is a bsic doubt.
 
  I found in different documentation it is mentioned that Hadoop is not
 recommended for online applications. Can anyone please elaborate on the same
 ?
 
  Regards,
  Sourav
 
   CAUTION - Disclaimer *
  This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
  for the use of the addressee(s). If you are not the intended recipient,
 please
  notify the sender by e-mail and delete the original message. Further, you
 are not
  to copy, disclose, or distribute this e-mail or its contents to any other
 person and
  any such actions are unlawful. This e-mail may contain viruses. Infosys
 has taken
  every reasonable precaution to minimize this risk, but is not liable for
 any damage
  you may sustain as a result of any virus in this e-mail. You should carry
 out your
  own virus checks before opening the e-mail or attachment. Infosys
 reserves the
  right to monitor and review the content of all messages sent to or from
 this e-mail
  address. Messages sent to or from this e-mail address may be stored on
 the
  Infosys e-mail system.
  ***INFOSYS End of Disclaimer INFOSYS***
 




Re: Issue in reduce phase with SortedMapWritable and custom Writables as values

2008-09-09 Thread Ryan LeCompte
Okay, I think I'm getting closer but now I'm running into another problem.

First off, I created my own CustomMapWritable that extends MapWritable
and invokes AbstractMapWritable.addToMap() to add my custom classes.
Now the map/reduce phases actually complete and the job as a whole
completes. However, when I try to use the SequenceFile API to later
read the output data, I'm getting a strange exception. First the code:

FileSystem fileSys = FileSystem.get(conf);
SequenceFile.Reader reader = new SequenceFile.Reader(fileSys, inFile,
conf);
Text key = new Text();
CustomWritable stats = new CustomWritable();
reader.next(key, stats);
reader.close();

And now the exception that's thrown:

java.io.IOException: can't find class: com.test.CustomStatsWritable
because com.test.CustomStatsWritable
at 
org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:210)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:145)
at com.test.CustomStatsWritable.readFields(UserStatsWritable.java:49)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
...

Any ideas?

Thanks,
Ryan


On Tue, Sep 9, 2008 at 12:36 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm attempting to use a SortedMapWritable with a LongWritable as the
 key and a custom implementation of org.apache.hadoop.io.Writable as
 the value. I notice that my program works fine when I use another
 primitive wrapper (e.g. Text) as the value, but fails with the
 following exception when I use my custom Writable instance:

 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask:
 Initiating in-memory merge with 1 segments...
 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging
 1 sorted segments
 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to
 the last merge-pass, with 1 segments left of total size: 5492 bytes
 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw
 a
 n exception: java.io.IOException: Intermedate merge failed
at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133)
at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064)
 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at 
 org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179)
...

 I noticed that the AbstractMapWritable class has a protected
 addToMap(Class clazz) method. Do I somehow need to let my
 SortedMapWritable instance know about my custom Writable value? I've
 properly implemented the custom Writable object (it just contains a
 few primitives, like longs and ints).

 Any insight is appreciated.

 Thanks,
 Ryan



Re: Issue in reduce phase with SortedMapWritable and custom Writables as values

2008-09-09 Thread Ryan LeCompte
Based on some similar problems that I found others were having in the
mailing lists, it looks like the solution was to list my Map/Reduce
job JAR In the conf/hadoop-env.sh file under HADOOP_CLASSPATH. After
doing that and re-submitting the job, it all worked fine! I guess the
MapWritable class somehow doesn't share the same classpath as the
program that actually submits the job conf. Is this expected?

Thanks,
Ryan


On Tue, Sep 9, 2008 at 9:44 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Okay, I think I'm getting closer but now I'm running into another problem.

 First off, I created my own CustomMapWritable that extends MapWritable
 and invokes AbstractMapWritable.addToMap() to add my custom classes.
 Now the map/reduce phases actually complete and the job as a whole
 completes. However, when I try to use the SequenceFile API to later
 read the output data, I'm getting a strange exception. First the code:

 FileSystem fileSys = FileSystem.get(conf);
 SequenceFile.Reader reader = new SequenceFile.Reader(fileSys, inFile,
 conf);
 Text key = new Text();
 CustomWritable stats = new CustomWritable();
 reader.next(key, stats);
 reader.close();

 And now the exception that's thrown:

 java.io.IOException: can't find class: com.test.CustomStatsWritable
 because com.test.CustomStatsWritable
at 
 org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:210)
at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:145)
at com.test.CustomStatsWritable.readFields(UserStatsWritable.java:49)
at 
 org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751)
at 
 org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
 ...

 Any ideas?

 Thanks,
 Ryan


 On Tue, Sep 9, 2008 at 12:36 AM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm attempting to use a SortedMapWritable with a LongWritable as the
 key and a custom implementation of org.apache.hadoop.io.Writable as
 the value. I notice that my program works fine when I use another
 primitive wrapper (e.g. Text) as the value, but fails with the
 following exception when I use my custom Writable instance:

 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask:
 Initiating in-memory merge with 1 segments...
 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging
 1 sorted segments
 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to
 the last merge-pass, with 1 segments left of total size: 5492 bytes
 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask:
 attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw
 a
 n exception: java.io.IOException: Intermedate merge failed
at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133)
at 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064)
 Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at 
 org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179)
...

 I noticed that the AbstractMapWritable class has a protected
 addToMap(Class clazz) method. Do I somehow need to let my
 SortedMapWritable instance know about my custom Writable value? I've
 properly implemented the custom Writable object (it just contains a
 few primitives, like longs and ints).

 Any insight is appreciated.

 Thanks,
 Ryan




Issue in reduce phase with SortedMapWritable and custom Writables as values

2008-09-08 Thread Ryan LeCompte
Hello,

I'm attempting to use a SortedMapWritable with a LongWritable as the
key and a custom implementation of org.apache.hadoop.io.Writable as
the value. I notice that my program works fine when I use another
primitive wrapper (e.g. Text) as the value, but fails with the
following exception when I use my custom Writable instance:

2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask:
Initiating in-memory merge with 1 segments...
2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging
1 sorted segments
2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to
the last merge-pass, with 1 segments left of total size: 5492 bytes
2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw
a
n exception: java.io.IOException: Intermedate merge failed
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133)
at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064)
Caused by: java.lang.RuntimeException: java.lang.NullPointerException
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80)
at 
org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179)
...

I noticed that the AbstractMapWritable class has a protected
addToMap(Class clazz) method. Do I somehow need to let my
SortedMapWritable instance know about my custom Writable value? I've
properly implemented the custom Writable object (it just contains a
few primitives, like longs and ints).

Any insight is appreciated.

Thanks,
Ryan


Re: Multiple input files

2008-09-06 Thread Ryan LeCompte
Hi Sayali,

Yes, you can submit a collection of files from HDFS as input to the
job. Please take a look at the WordCount example in the Map/Reduce
tutorial for an example:

http://hadoop.apache.org/core/docs/r0.18.0/mapred_tutorial.html#Example%3A+WordCount+v1.0

Ryan


On Sat, Sep 6, 2008 at 9:03 AM, Sayali Kulkarni
[EMAIL PROTECTED] wrote:
 Hello,
 When starting a hadoop job, I need to specify an input file and an output 
 file. Can I instead specify a list of input files?
 example, I have the input distributed in the files:
 file000,
 file001,
 file002,
 file003,
 ...
 So I can I specify input files as file*. I can add all my files to HDFS.

 Thanks in advance!
 --Sayali




Multiple output files

2008-09-06 Thread Ryan LeCompte
Hello,

I have a question regarding multiple output files that get produced as
a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-), or is it possible
that the values could be split across multiple part- files? At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files? If so, does Hadoop
automatically grab all of the values for a particular key in all of
the output files?

Thanks,
Ryan


Re: Multiple output files

2008-09-06 Thread Ryan LeCompte

This clears up my concerns. Thanks!

Ryan



On Sep 6, 2008, at 2:17 PM, Owen O'Malley [EMAIL PROTECTED] wrote:



On Sep 6, 2008, at 9:35 AM, Ryan LeCompte wrote:

I have a question regarding multiple output files that get produced  
as

a result of using multiple reduce tasks for a job (as opposed to only
one). If I'm using a custom writable and thus writing to a sequence
output, am I gauranteed that all of the day for a particular key will
appear in a single output file (e.g., part-), or is it possible
that the values could be split across multiple part- files?


Each key will be processed by exactly one reduce. All of the keys to  
each reduce will be sorted. The application can define a Partitioner  
that picks the reduce for each key. The default one uses  
key.hashCode() % numReduces, which is usually balanced. If your key  
had both a date and time and you wanted to have all of the  
transactions for a given day in the same reduce, you could do:


class MyKey {
 Date date;
 Time time;
}

and use a partitioner like:

public class MyPartitioner extends Partitioner {
 public int getPartition(MyKey key, MyValue value,
 int numReduceTasks) {
   return (key.date.hashCode()  Integer.MAX_VALUE) % numReduceTasks;
 }
}

Of course the risk is that you may have very unbalanced reduce  
sizes, depending on your data.



At the
end of the job I'm using the sequence file reader to read each custom
key/writable pair from each output file. Is it possible that the same
key could appear in multiple output files?


No.

-- Owen


Custom Writeables

2008-09-05 Thread Ryan LeCompte
Hello,

Can a custom Writeable object used as a key/value contain other
writeables, like MapWriteable?

Thanks,
Ryan


Hadoop + Elastic Block Stores

2008-09-05 Thread Ryan LeCompte
Hello,

I was wondering if anyone has gotten far at all with getting Hadoop up
and running with EC2 + EBS? Any luck getting this to work in a way
that the HDFS runs on the EBS so that it isn't blown away every time
you bring up/down the EC2 Hadoop cluster? I'd like to experiment with
this next, and was curious if anyone had any luck. :)

Thanks!

Ryan


Re: Hadoop + Elastic Block Stores

2008-09-05 Thread Ryan LeCompte
Good to know that you got it up and running. I'd really love to one  
day see some scripts under src/contrib/ec2/bin that can setup/mount  
the EBS volumes automatically. :-)




On Sep 5, 2008, at 11:38 PM, Jean-Daniel Cryans  
[EMAIL PROTECTED] wrote:



Ryan,

I currently have a Hadoop/HBase setup that uses EBS. It works but  
using EBS
implied an additional overhead of configuration (too bad you can't  
spawn
instances with volumes already attached to it tho I'm sure that'll  
come).

Shutting down instances and bringing others up also requires more
micro-management but I think Tom White wrote about it and there was  
a link

to it in another discussion you were part of.

Hope this helps,

J-D

On Fri, Sep 5, 2008 at 7:00 PM, Ryan LeCompte [EMAIL PROTECTED]  
wrote:



Hello,

I was wondering if anyone has gotten far at all with getting Hadoop  
up

and running with EC2 + EBS? Any luck getting this to work in a way
that the HDFS runs on the EBS so that it isn't blown away every time
you bring up/down the EC2 Hadoop cluster? I'd like to experiment with
this next, and was curious if anyone had any luck. :)

Thanks!

Ryan



Re: Hadoop EC2

2008-09-04 Thread Ryan LeCompte
Hi Tom,

This clears up my questions.

Thanks!

Ryan



On Thu, Sep 4, 2008 at 9:21 AM, Tom White [EMAIL PROTECTED] wrote:
 On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
 multi-gigabyte files in ~64MB chunks.

 That's because S3Filesystem stores files as 64MB blocks on S3.

 Then, when this is copied from
 S3 into HDFS using bin/hadoop distcp. Once the files are there and the
 job begins, it looks like it's breaking up the 4 multigigabyte text
 files into about 225 maps. Does this mean that each map is roughly
 processing 64MB of data each?

 Yes, HDFS stores files as 64MB blocks too, and map input is split by
 default so each map processes one block.

If so, is there any way to change this
 so that I can get my map tasks to process more data at a time? I'm
 curious if this will shorten the time it takes to run the program.

 You could try increasing the HDFS block size. 128MB is actually
 usually a better value, for this very reason.

 In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
 help here too.


 Tom, in your article about Hadoop + EC2 you mention processing about
 100GB of logs in under 6 minutes or so.

 In this article:
 http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
 it took 35 minutes to run the job. I'm planning on doing some
 benchmarking on EC2 fairly soon, which should help us improve the
 performance of Hadoop on EC2. It's worth remarking that this was
 running on small instances. The larger instances perform a lot better
 in my experience.

 Do you remember how many EC2
 instances you had running, and also how many map tasks did you have to
 operate on the 100GB? Was each map task handling about 1GB each?

 I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

 Hope this helps,

 Tom



Re: EC2 AMI for Hadoop 0.18.0

2008-09-04 Thread Ryan LeCompte
Works great!

My only suggestion would be to modify the
/usr/local/hadoop-0.18.0/conf/hadoop-site.xml file to use hdfs://...
for the namenode address. Otherwise I constantly get warnings saying
that the syntax is deprecated any time I submit a job for execution or
interact with HDFS via bin/hadoop fs -ls ...

Thanks,
Ryan


On Thu, Sep 4, 2008 at 10:25 AM, Stuart Sierra [EMAIL PROTECTED] wrote:
 Thanks, Tom!
 -Stuart

 On Wed, Sep 3, 2008 at 12:43 PM, Tom White [EMAIL PROTECTED] wrote:
 I've just created public AMIs for 0.18.0. Note that they are in the
 hadoop-images bucket.

 Tom

 On Fri, Aug 29, 2008 at 9:22 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote:

 Anybody have one?  Any success building it with create-hadoop-image?
 Thanks,
 -Stuart

 I was able to build one following the instructions in the wiki.  You'll need
 to find the Java download url (see wiki) and put it and your own S3 bucket
 name in hadoop-ec2-env.sh.







Re: Hadoop EC2

2008-09-03 Thread Ryan LeCompte
Tom,

I noticed that you mentioned using Amazon's new elastic block store as
an alternative to using S3. Right now I'm testing pushing data to S3,
then moving it from S3 into HDFS once the Hadoop cluster is up and
running in EC2. It works pretty well -- moving data from S3 to HDFS is
fast when the data in S3 is broken up into multiple files, since
bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
data.

Are there any real advantages to using the new elastic block store? Is
moving data from the elastic block store into HDFS any faster than
doing it from S3? Or can HDFS essentially live inside of the elastic
block store?

Thanks!

Ryan


On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote:
 There's a case study with some numbers in it from a presentation I
 gave on Hadoop and AWS in London last month, which you may find
 interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

 tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 This sounds very useful. Please consider creating a Jira and
 submitting the code (even if it's not finished folks might like to
 see it). Thanks.

 Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan








Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Are you mostly just processing/parsing textual log files? How many
maps/reduces did you configure in your hadoop-ec2-env.sh file? How
many did you configure in your JobConf? Just trying to get an idea of
what to expect in terms of performance. I'm noticing that it takes
about 16 minutes to transfer about 15GB of textual uncompressed data
from S3 into HDFS after the cluster has started with 15 nodes. I was
expecting this to take a shorter amount of time, but maybe I'm
incorrect in my assumptions. I am also noticing that it takes about 15
minutes to parse through the 15GB of data with a 15 node cluster.

Thanks,
Ryan


On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan





Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
Hi Tim,

Thanks for responding -- I believe that I'll need the full power of
Hadoop since I'll want this to scale well beyond 100GB of data. Thanks
for sharing your experiences -- I'll definitely check out your blog.

Thanks!

Ryan


On Tue, Sep 2, 2008 at 8:47 AM, tim robertson [EMAIL PROTECTED] wrote:
 Hi Ryan,

 I actually blogged my experience as it was my first usage of EC2:
 http://biodivertido.blogspot.com/2008/06/hadoop-on-amazon-ec2-to-generate.html

 My input data was not log files but actually a dump if 150million
 records from Mysql into about 13 columns of tab file data I believe.
 It was a couple of months ago, but I remember thinking S3 was very slow...

 I ran some simple operations like distinct values of one column based
 on another (species within a cell) and also did some Polygon analysis
 since to do is this point in this polygon does not really scale too
 well in PostGIS.

 Incidentally, I have most of the basics of a MapReduce-Lite which I
 aim to port to use the exact Hadoop API since I am *only* working on
 10's-100's GB of data and find that it is running really fine on my
 laptop and I don't need the distributed failover.  My goal for that
 code is for people like me who want to know that I can scale to
 terrabyte processing, but don't need to take the plunge to full Hadoop
 deployment yet, but will know that I can migrate the processing in the
 future as  things grow.  It runs on the normal filesystem, and single
 node only (e.g. multithreaded), and performs very quickly since it is
 just doing java NIO bytebuffers in parallel on the underlying
 filesystem - on my laptop I Map+Sort+Combine about 130,000 jobs a
 seconds (simplest of simple map operations).  For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan







Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-02 Thread Ryan LeCompte
Actually not if you're using the s3:// as opposed to s3n:// ...

Thanks,
Ryan


On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote:
 On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm trying to upload a fairly large file (18GB or so) to my AWS S3
 account via bin/hadoop fs -put ... s3://...

 Isn't the maximum size of a file on s3 5GB?

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com



Re: Hadoop EC2

2008-09-02 Thread Ryan LeCompte
How can you ensure that the S3 buckets and EC2 instances belong to a
certain zone?

Ryan


On Tue, Sep 2, 2008 at 2:38 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 2-Sep-08, at 5:22 AM, Ryan LeCompte wrote:

 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 I'm seeing much faster speeds.  With 128 nodes running a mapper-only
 downloading job, downloading 30 GB takes roughly a minute, less time than
 the end of job work (which I assume is HDFS replication and bookkeeping).
  More mappers gives you more parallel downloads, of course.  I'm using a
 Python REST client for S3, and only move data to or from S3 when Hadoop is
 done with it.

 Make sure your S3 buckets and EC2 instances are in the same zone.




Re: JVM Spawning

2008-09-02 Thread Ryan LeCompte
I see... so there really isn't a way for me to test a map/reduce
program using a single node without incurring the overhead of
upping/downing JVM's... My input is broken up into 5 text files is
there a way I could start the job such that it only uses 1 map to
process the whole thing? I guess I'd have to concatenate the files
into 1 file and somehow turn off splitting?

Ryan


On Wed, Sep 3, 2008 at 12:09 AM, Owen O'Malley [EMAIL PROTECTED] wrote:

 On Sep 2, 2008, at 9:00 PM, Ryan LeCompte wrote:

 Beginner's question:

 If I have a cluster with a single node that has a max of 1 map/1
 reduce, and the job submitted has 50 maps... Then it will process only
 1 map at a time. Does that mean that it's spawning 1 new JVM for each
 map processed? Or re-using the same JVM when a new map can be
 processed?

 It creates a new JVM for each task. Devaraj is working on
 https://issues.apache.org/jira/browse/HADOOP-249
 which will allow the jvms to run multiple tasks sequentially.

 -- Owen



Error while uploading large file to S3 via Hadoop 0.18

2008-09-01 Thread Ryan LeCompte
Hello,

I'm trying to upload a fairly large file (18GB or so) to my AWS S3
account via bin/hadoop fs -put ... s3://...

It copies for a good 15 or 20 minutes, and then eventually errors out
with a failed retry attempt (saying that it can't retry since it has
already written a certain number of bytes, etc. sorry don't have the
original error message at the moment). Has anyone experienced anything
similar? Can anyone suggest a workaround or a way to specify retries?
Should I use another tool for uploading large files to s3?

Thanks,
Ryan


Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-01 Thread Ryan LeCompte
Thanks, trying it now!

Ryan


On Mon, Sep 1, 2008 at 6:04 PM, Albert Chern [EMAIL PROTECTED] wrote:
 Increase the retry buffer size in jets3t.properties and maybe up the number
 of retries while you're at it.  If there is no template file included in
 Hadoop's conf dir you can find it at the jets3t web site.  Make sure that
 it's from the same version that your copy of Hadoop is using.

 On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:

 Hello,

 I'm trying to upload a fairly large file (18GB or so) to my AWS S3
 account via bin/hadoop fs -put ... s3://...

 It copies for a good 15 or 20 minutes, and then eventually errors out
 with a failed retry attempt (saying that it can't retry since it has
 already written a certain number of bytes, etc. sorry don't have the
 original error message at the moment). Has anyone experienced anything
 similar? Can anyone suggest a workaround or a way to specify retries?
 Should I use another tool for uploading large files to s3?

 Thanks,
 Ryan




Hadoop EC2

2008-09-01 Thread Ryan LeCompte
Hello all,

I'm curious to see how many people are using EC2 to execute their
Hadoop cluster and map/reduce programs, and how many are using
home-grown datacenters. It seems like the 20 node limit with EC2 is a
bit crippling when one wants to process many gigabytes of data. Has
anyone found this to be the case? How much data are people processing
with their 20 node limit on EC2? Curious what the thoughts are...

Thanks,
Ryan


Reduce hanging with custom value objects?

2008-08-30 Thread Ryan LeCompte
Hello all,

I'm new to Hadoop. I'm trying to write a small hadoop map/reduce
program that instead of reading/writing the primitive
LongWritable,IntWritable, etc. classes I'm using a custom object that
I wrote that implements the Writable interface. I'm still using a
LongWritable for the keys, but using my CustomWritable for the values.
My input is still a LongWritable,Text because I'm parsing a raw log
file. However, the output of the map is LongWritable,CustomWritable
and the input to reduce is LongWritable,CustomWritable, and the output
from reduce is LongWritable,CustomWritable.

I'm noticing that the map part processes fine, however when it gets to
reduce it just hangs at 0%. I'm not seeing any useful output in the
logs either. Here's my job configuration:

  conf.setJobName(test);
  conf.setOutputKeyClass(LongWritable.class);
  conf.setOutputValueClass(CustomWritable.class);
  conf.setMapperClass(MapClass.class);
  conf.setCombinerClass(Reduce.class);
  conf.setReducerClass(Reduce.class);
  FileInputFormat.setInputPaths(conf, remainingArguments.get(0));
  FileOutputFormat.setOutputPath(conf, new Path(remainingArguments.get(1)));

Am I missing something here? Do I need to specify the conf.OutputFormat?

Thanks,
Ryan