Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread 叶双明
Thanks Ari Rabkin!

1. I think the cost is very low, if the block's size is 10m, 1k/10m almost
0.01% of the disk space.

2. Actually, if two of racks lose and replication <= 3, it seem that we
can't recover all data. But in the situation of losing one rack of two racks
and replication >=2, we can recover all data.

3. Suppose we recover 87.5% of data. I am not sure whether or not the random
87.5% of the data is usefull for every user. But in the situation of the
size of most file is less than block'size, we can recover  so much data,.Any
recovered data may be  valuable for some user.

4. I guess most small companies or organizations just have a cluster with
10-100 nodes, and they can not afford a second HDFS cluster in a different
place or SAN. And it is a simple way to I think they would be pleased to
ensure data safety for they.

5. We can config to turn on when someone need it, or turn it off otherwise.

Glad to discuss with you!


2008/9/11 Ariel Rabkin <[EMAIL PROTECTED]>

> I don't understand this use case.
>
> Suppose that you lose half the nodes in the cluster.  On average,
> 12.5% of your blocks were exclusively stored on the half the cluster
> that's dead.  For many (most?) applications, a random 87.5% of the
> data isn't really useful.  Storing metadata in more places would let
> you turn a dead cluster into a corrupt cluster, but not into a working
> one.   If you need to survive major disasters, you want a second HDFS
> cluster in a different place.
>
> The thing that might be useful to you, if you're worried about
> simultaneous namenode and secondary NN failure, is to store the edit
> log and fsimage on a SAN, and get fault tolerance that way.
>
> --Ari
>
> On Tue, Sep 9, 2008 at 6:38 PM, 叶双明 <[EMAIL PROTECTED]> wrote:
> > Thanks for paying attention  to my tentative idea!
> >
> > What I thought isn't how to store the meradata, but the final (or last)
> way
> > to recover valuable data in the cluster when something worst (which
> destroy
> > the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
> > natural disasters destroy half of cluster nodes within all NameNode, we
> can
> > recover as much data as possible by this mechanism, and hava big chance
> to
> > recover entire data of cluster because fo original replication.
> >
> > Any suggestion is appreciate!
> >
> > 2008/9/10 Pete Wyckoff <[EMAIL PROTECTED]>
> >
> >> +1 -
> >>
> >> from the perspective of the data nodes, dfs is just a block-level store
> and
> >> is thus much more robust and scalable.
> >>
> >>
> >>
> >> On 9/9/08 9:14 AM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:
> >>
> >> > This isn't a very stable direction. You really don't want multiple
> >> distinct
> >> > methods for storing the metadata, because discrepancies are very bad.
> >> High
> >> > Availability (HA) is a very important medium term goal for HDFS, but
> it
> >> will
> >> > likely be done using multiple NameNodes and ZooKeeper.
> >> >
> >> > -- Owen
> >>
>
> --
> Ari Rabkin [EMAIL PROTECTED]
> UC Berkeley Computer Science Department
>



-- 
Sorry for my english!!  明
Please help me to correct my english expression and error in syntax


Re: Ruby hadoop client alpha available

2008-09-10 Thread David Richards
Very cool!!  This saves me a LOT of time on my Tegu project.  I'll be  
reviewing this tonight.


Thanks James!

David


On Sep 10, 2008, at 7:28 PM, James Moore wrote:


I'm looking for some feedback on my new JRuby interface for Hadoop
(currently called "radoop").  It's definitely in alpha state - it's
probably most interesting for people who have already tried messing
around with JRuby/Hadoop and would like a nicer interface.

I've put the source for radoop up on github at

http://github.com/banshee/radoop/tree/master

(and you can pull from git://github.com/banshee/radoop.git)

Documentation is limited - working on that.

The gem should be there shortly - there's a gemspec file, but it looks
like github builds gems in batches and it may not be there yet.

To use, you subclass Radoop like this:


require 'rubygems'
require 'radoop'

class WordCount < Radoop
  include_package "org.apache.hadoop.io"

  output_key_class Text
  output_value_class IntWritable

  def map(k, v, output, reporter)
values = v.to_s.split(/\W+/)
values.each do |v|
  output.collect(v.to_hadoop_text, 1.to_int_writable)
end
  end
end


And then run the radoop command (here with --verbose turned on).  No
compiling, no building jars, it should just feel something like
running a normal Ruby script.

Options are things like:

--radoop_file word_count.rb # The name of the file containing the
Radoop subclass
--radoop_class WordCount # The name of the Radoop subclass
--output_path /tmp/j1
--input_path /user/james/tmp/tale.txt,/user/james/tmp/tale1.txt -v

Radoop handles zipping up your jruby install directory, your gem
directories, and your ruby files, and puts them on the machines
running tasks using the DistributedCache mechanism.


James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com




Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-10 Thread Raghu Angadi

Thanks Stefan.

What you are seeing is fixed in HADOOP-3232. It is different from main 
problems reported in this thread. Please try 0.18.1 and see how it works.


Raghu.

Stefan Will wrote:

I'll add a comment to Jira. I haven't tried the latest version of the patch
yet, but since it's only changes the dfs client, not the datanode, I don't
see how it would help with this.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing) is:

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

-- Stefan



Re: Ruby hadoop client alpha available

2008-09-10 Thread James Moore
Meant to include the command to actually execute that word count example:

radoop --radoop_file word_count.rb --radoop_class WordCount
--output_path /tmp/j1 --input_path
/user/james/tmp/tale.txt,/user/james/tmp/tale1.txt -v

Where there are files on DFS like this:

-rw-r--r--   1 james supergroup 87 2008-09-07 10:38
/user/james/tmp/tale.txt
-rw-r--r--   1 james supergroup 87 2008-09-07 10:48
/user/james/tmp/tale1.txt

-- 
James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Ruby hadoop client alpha available

2008-09-10 Thread James Moore
I'm looking for some feedback on my new JRuby interface for Hadoop
(currently called "radoop").  It's definitely in alpha state - it's
probably most interesting for people who have already tried messing
around with JRuby/Hadoop and would like a nicer interface.

I've put the source for radoop up on github at

http://github.com/banshee/radoop/tree/master

(and you can pull from git://github.com/banshee/radoop.git)

Documentation is limited - working on that.

The gem should be there shortly - there's a gemspec file, but it looks
like github builds gems in batches and it may not be there yet.

To use, you subclass Radoop like this:


require 'rubygems'
require 'radoop'

class WordCount < Radoop
  include_package "org.apache.hadoop.io"

  output_key_class Text
  output_value_class IntWritable

  def map(k, v, output, reporter)
values = v.to_s.split(/\W+/)
values.each do |v|
  output.collect(v.to_hadoop_text, 1.to_int_writable)
end
  end
end


And then run the radoop command (here with --verbose turned on).  No
compiling, no building jars, it should just feel something like
running a normal Ruby script.

Options are things like:

--radoop_file word_count.rb # The name of the file containing the
Radoop subclass
--radoop_class WordCount # The name of the Radoop subclass
--output_path /tmp/j1
--input_path /user/james/tmp/tale.txt,/user/james/tmp/tale1.txt -v

Radoop handles zipping up your jruby install directory, your gem
directories, and your ruby files, and puts them on the machines
running tasks using the DistributedCache mechanism.


James Moore | [EMAIL PROTECTED]
Ruby and Ruby on Rails consulting
blog.restphone.com


Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-10 Thread Stefan Will
I'll add a comment to Jira. I haven't tried the latest version of the patch
yet, but since it's only changes the dfs client, not the datanode, I don't
see how it would help with this.

Two more things I noticed that happen when the datanodes become unresponsive
(i.e. The "Last Contact" field on the namenode keeps increasing) is:

1. The datanode process seem to be completely hung for a while, including
its Jetty web interface, sometimes for over 10 minutes.

2. The task tracker on the same machine keeps humming along, sending regular
heartbeats

To me this looks like there is some sort of temporary deadlock in the
datanode that keeps it from responding to requests. Perhaps it's the block
report being generated ?

-- Stefan

> From: Raghu Angadi <[EMAIL PROTECTED]>
> Reply-To: 
> Date: Tue, 09 Sep 2008 16:40:02 -0700
> To: 
> Subject: Re: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
> 
> Espen Amble Kolstad wrote:
>> There's a JIRA on this already:
>> https://issues.apache.org/jira/browse/HADOOP-3831
>> Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems
>> to do the trick for now.
> 
> Please comment on HADOOP-3831 that you are seeing this error.. so that
> it gets committed. Did you try the patch for HADOOP-3831?
> 
> thanks,
> Raghu.
> 
>> Espen
>> 
>> On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad <[EMAIL PROTECTED]> 
>> wrote:
>>> Hi,
>>> 
>>> Thanks for the tip!
>>> 
>>> I tried revision 692572 of the 0.18 branch, but I still get the same errors.
>>> 
>>> On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote:
 The DFS errors might have been caused by
 
 http://issues.apache.org/jira/browse/HADOOP-4040
 
 thanks,
 dhruba
 
 On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> These exceptions are apparently coming from the dfs side of things. Could
> someone from the dfs side please look at these?
> 
> On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>> Hi,
>> 
>> Thanks!
>> The patch applies without change to hadoop-0.18.0, and should be
>> included in a 0.18.1.
>> 
>> However, I'm still seeing:
>> in hadoop.log:
>> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
>> from blk_3428404120239503595_2664 of
>> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
>> somehost:50010: java.io.IOException: Premeture EOF from in
>> putStream
>> 
>> in datanode.log:
>> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
>> DatanodeRegistration(somehost:50010,
>> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
>> ipcPort=50020):Got exception while serving
>> blk_-4682098638573619471_2662 to
>> /somehost:
>> java.net.SocketTimeoutException: 48 millis timeout while waiting
>> for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/somehost:50010
>> remote=/somehost:45244]
>> 
>> These entries in datanode.log happens a few minutes apart repeatedly.
>> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
>> free memory (so it's not resource starvation).
>> 
>> Espen
>> 
>> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
 I started a profile of the reduce-task. I've attached the profiling
 output. It seems from the samples that ramManager.waitForDataToMerge()
 doesn't actually wait.
 Has anybody seen this behavior.
>>> This has been fixed in HADOOP-3940
>>> 
>>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
 I have the same problem on our cluster.
 
 It seems the reducer-tasks are using all cpu, long before there's
 anything to
 shuffle.
 
 I started a profile of the reduce-task. I've attached the profiling
 output. It seems from the samples that ramManager.waitForDataToMerge()
 doesn't actually wait.
 Has anybody seen this behavior.
 
 Espen
 
 On Thursday 28 August 2008 06:11:42 wangxu wrote:
> Hi,all
> I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
> and running hadoop on one namenode and 4 slaves.
> attached is my hadoop-site.xml, and I didn't change the file
> hadoop-default.xml
> 
> when data in segments are large,this kind of errors occure:
> 
> java.io.IOException: Could not obtain block:
> blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-
> 2/data at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie
> nt.jav a:1462) at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream

Re: installing hadoop on a OS X cluster

2008-09-10 Thread Sandy
Thanks for the swift response.

I have 4 disk drives (please see specs), so I'm not sure if the hard disk
will still be a bottleneck. Would you agree?
I think we are dealing with data intensive jobs... my input data can be as
large as a few gigabytes in size (though theoretically it could be larger).
I understand that in comparison to what some people run this may seem small
though. I tried running something on my old machine, and it took several
hours to complete the reduce in the first map reduce phase before running
out of memory (and this was after I increased the heap size).

I'm trying to increase the max heap size on this machine in hadoop-env.sh
past 2000, but hadoop gives me errors. Is this normal? I'm running
hadoop-0.17.2. Is there anywhere else I need to specify a heap increase?

Lastly, I think one more modification I will be needing to make is
increasing the maximum number of map/reduce tasks to 8 (one per core). I
made that change in hadoop-site.xml, by adding an additional property:


mapred.tasktracker.tasks.maximum
8
The maximum number of tasks that will be run simultaneously by
a
a task tracker



I don't see a mapred-default.xml file in the conf folder. I'm guessing this
was removed in later versions? Is there anywhere else I would need to
specify an increase in map and reduce tasks aside from

JobConf.setNumMapTasks and JobConf.setNumReduceTasks?

Thanks again for your time.

-SM
PS - I'm going to update the wiki with installation instructions for OS X as
soon as I get everything finshed up :-)



On Wed, Sep 10, 2008 at 5:23 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:

> Apparently you have one node with 2 processors where each processor has 4
> cores. What do you want to use Hadoop for? If you have a single disk drive
> and multiple cores on one node then pseudo distributed environment seems
> like the best approach to me as long as you are not dealing with large
> amounts of data. If you have a single disk drive and huge amount of data to
> process, then the disk drive might be a bottleneck for your applications.
> Hadoop is usually used for data intensive applications whereas your
> hardware
> seems more like to be designed for cpu intensive job considering 8 cores on
> a single node.
>
> Tim
>
> On Wed, Sep 10, 2008 at 4:59 PM, Sandy <[EMAIL PROTECTED]> wrote:
>
> > I am starting an install of hadoop on a new cluster. However, I am a
> little
> > confused what set of instructions I should follow, having only installed
> > and
> > played around with hadoop on a single node ubuntu box with 2 cores (on a
> > single board) and 2 GB of RAM.
> > The new machine has 2 internal nodes, each with 4 cores. I would like to
> > run
> > Hadoop to run in a distributed context over these 8 cores. One of my
> > biggest
> > issues is the definition of the word "node". From the Hadoop wiki and
> > documentation, it seems that "node" means "machine", and not a board. So,
> > by
> > this definition, our cluster is really one "node". Is this correct?
> >
> > If this is the case, then I shouldn't be using the "cluster setup"
> > instructions, located here:
> > http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html
> >
> > But this one:
> > http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html
> >
> > Which is what I've been doing. But what should the operation be? I don't
> > think it should be standalone. Should it be Psuedo-distributed? If so,
> how
> > can I guarantee that it will be spread over all the 8 processors? What is
> > necessary for the hadoop-site.xml file?
> >
> > Here are the specs of the machine.
> >-Mac Pro RAID Card  065-7214
> >-Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534
> >
> >-16GB RAM (4 x 4GB) 065-7179
> >-1TB 7200-rpm Serial ATA 3Gb/s  065-7544
> >
> >-1TB 7200-rpm Serial ATA 3Gb/s  065-7546
> >
> >-1TB 7200-rpm Serial ATA 3Gb/s  065-7193
> >
> >-1TB 7200-rpm Serial ATA 3Gb/s  065-7548
> >
> >
> > Could someone please point me to the correct mode of
> operation/instructions
> > to install things correctly on this machine? I found some information how
> > to
> > install on a OS X machine in the archives, but they are a touch outdated
> > and
> > seems to be missing some things.
> >
> > Thank you very much for you time.
> >
> > -SM
> >
>


Re: Hadoop (0.18) Spill Failed, out of Heap Space error

2008-09-10 Thread Chris Douglas
It should be in the task logs for the maps. You're not seeing  
"bufstart", "bufvoid", etc.? The logging is at INFO level, so if  
you've set your loglevel higher that, you won't see these messages.  
Are you seeing any logging from MapTask? How large are your serialized  
values? -C


On Sep 10, 2008, at 1:50 PM, Florian Leibert wrote:


Hi Chris,
where do you find those values? I don't seem to see them in the  
userlogs nor in the tasktracker logs...


Thanks!

Florian
On Sep 3, 2008, at 2:50 PM, Chris Douglas wrote:

InMemValBytes::reset can perform an allocation, but it should be  
only as large as the value. When you look at the log for the failed  
task, what does it report as the values of bufstart, bufend,  
kvstart, kvend, etc. before the spill? -C


On Sep 3, 2008, at 9:49 AM, Paco NATHAN wrote:


Also, that almost always happens early in the map phase of the first
MR job which runs on our cluster.

Hadoop 0.18.1 on EC2 m1.xl instances.

We run 10 MR jobs in sequence, 6hr duration, not seeing the problem
repeated after that 1 heap space exception.

Paco


On Wed, Sep 3, 2008 at 11:42 AM, Florian Leibert <[EMAIL PROTECTED]>  
wrote:

Hi,
we're running 100 XLarge instances (ec2), with a gig of heap  
space for each
task - and are seeing the following error frequently (but not  
always):

# BEGIN PASTE #
 [exec] 08/09/03 11:21:09 INFO mapred.JobClient:  map 43% reduce 5%
 [exec] 08/09/03 11:21:16 INFO mapred.JobClient: Task Id :
attempt_200809031101_0001_m_000220_0, Status : FAILED
 [exec] java.io.IOException: Spill failed
 [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.flush(MapTask.java:688)
 [exec] at org.apache.hadoop.mapred.MapTask.run(MapTask.java: 
228)

 [exec] at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2209)

 [exec] Caused by: java.lang.OutOfMemoryError: Java heap space
 [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
$InMemValBytes.reset(MapTask.java:928)

 [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.getVBytesForOffset(MapTask.java:891)

 [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.sortAndSpill(MapTask.java:765)

 [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access 
$1600(MapTask.java:286)

 [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
$SpillThread.run(MapTask.java:712)

# END #

Has anyone seen this? Thanks,

Florian Leibert
Sr. Software Engineer
Adknowledge Inc.









Re: installing hadoop on a OS X cluster

2008-09-10 Thread Jim Twensky
Apparently you have one node with 2 processors where each processor has 4
cores. What do you want to use Hadoop for? If you have a single disk drive
and multiple cores on one node then pseudo distributed environment seems
like the best approach to me as long as you are not dealing with large
amounts of data. If you have a single disk drive and huge amount of data to
process, then the disk drive might be a bottleneck for your applications.
Hadoop is usually used for data intensive applications whereas your hardware
seems more like to be designed for cpu intensive job considering 8 cores on
a single node.

Tim

On Wed, Sep 10, 2008 at 4:59 PM, Sandy <[EMAIL PROTECTED]> wrote:

> I am starting an install of hadoop on a new cluster. However, I am a little
> confused what set of instructions I should follow, having only installed
> and
> played around with hadoop on a single node ubuntu box with 2 cores (on a
> single board) and 2 GB of RAM.
> The new machine has 2 internal nodes, each with 4 cores. I would like to
> run
> Hadoop to run in a distributed context over these 8 cores. One of my
> biggest
> issues is the definition of the word "node". From the Hadoop wiki and
> documentation, it seems that "node" means "machine", and not a board. So,
> by
> this definition, our cluster is really one "node". Is this correct?
>
> If this is the case, then I shouldn't be using the "cluster setup"
> instructions, located here:
> http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html
>
> But this one:
> http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html
>
> Which is what I've been doing. But what should the operation be? I don't
> think it should be standalone. Should it be Psuedo-distributed? If so, how
> can I guarantee that it will be spread over all the 8 processors? What is
> necessary for the hadoop-site.xml file?
>
> Here are the specs of the machine.
>-Mac Pro RAID Card  065-7214
>-Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534
>
>-16GB RAM (4 x 4GB) 065-7179
>-1TB 7200-rpm Serial ATA 3Gb/s  065-7544
>
>-1TB 7200-rpm Serial ATA 3Gb/s  065-7546
>
>-1TB 7200-rpm Serial ATA 3Gb/s  065-7193
>
>-1TB 7200-rpm Serial ATA 3Gb/s  065-7548
>
>
> Could someone please point me to the correct mode of operation/instructions
> to install things correctly on this machine? I found some information how
> to
> install on a OS X machine in the archives, but they are a touch outdated
> and
> seems to be missing some things.
>
> Thank you very much for you time.
>
> -SM
>


Re: specifying number of nodes for job

2008-09-10 Thread Sandy
Thank you very much for you response :)
-Suzanne

On Tue, Sep 9, 2008 at 12:13 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> On Mon, Sep 8, 2008 at 4:26 PM, Sandy <[EMAIL PROTECTED]> wrote:
>
> > In all seriousness though, why is this not possible? Is there something
> > about the MapReduce model of parallel computation that I am not
> > understanding? Or this more of an arbitrary implementation choice made by
> > the Hadoop framework? If so, I am curious why this is the case. What are
> > the
> > benefits?
>
>
> It is possible to do with changes to Hadoop. There was a jira filed for it,
> but I don't think anyone has worked on it. (HADOOP-2573) For Map/Reduce it
> is a design goal that number of tasks not nodes are the important metric.
> You want a job to be able to run with any given cluster size. For
> scalability testing, you could just remove task trackers...
>
> -- Owen
>


installing hadoop on a OS X cluster

2008-09-10 Thread Sandy
I am starting an install of hadoop on a new cluster. However, I am a little
confused what set of instructions I should follow, having only installed and
played around with hadoop on a single node ubuntu box with 2 cores (on a
single board) and 2 GB of RAM.
The new machine has 2 internal nodes, each with 4 cores. I would like to run
Hadoop to run in a distributed context over these 8 cores. One of my biggest
issues is the definition of the word "node". From the Hadoop wiki and
documentation, it seems that "node" means "machine", and not a board. So, by
this definition, our cluster is really one "node". Is this correct?

If this is the case, then I shouldn't be using the "cluster setup"
instructions, located here:
http://hadoop.apache.org/core/docs/r0.17.2/cluster_setup.html

But this one:
http://hadoop.apache.org/core/docs/r0.17.2/quickstart.html

Which is what I've been doing. But what should the operation be? I don't
think it should be standalone. Should it be Psuedo-distributed? If so, how
can I guarantee that it will be spread over all the 8 processors? What is
necessary for the hadoop-site.xml file?

Here are the specs of the machine.
-Mac Pro RAID Card  065-7214
-Two 3.0GHz Quad-Core Intel Xeon (8-core)   065-7534

-16GB RAM (4 x 4GB) 065-7179
-1TB 7200-rpm Serial ATA 3Gb/s  065-7544

-1TB 7200-rpm Serial ATA 3Gb/s  065-7546

-1TB 7200-rpm Serial ATA 3Gb/s  065-7193

-1TB 7200-rpm Serial ATA 3Gb/s  065-7548


Could someone please point me to the correct mode of operation/instructions
to install things correctly on this machine? I found some information how to
install on a OS X machine in the archives, but they are a touch outdated and
seems to be missing some things.

Thank you very much for you time.

-SM


Re: Hadoop (0.18) Spill Failed, out of Heap Space error

2008-09-10 Thread Florian Leibert

Hi Chris,
where do you find those values? I don't seem to see them in the  
userlogs nor in the tasktracker logs...


Thanks!

Florian
On Sep 3, 2008, at 2:50 PM, Chris Douglas wrote:

InMemValBytes::reset can perform an allocation, but it should be  
only as large as the value. When you look at the log for the failed  
task, what does it report as the values of bufstart, bufend,  
kvstart, kvend, etc. before the spill? -C


On Sep 3, 2008, at 9:49 AM, Paco NATHAN wrote:


Also, that almost always happens early in the map phase of the first
MR job which runs on our cluster.

Hadoop 0.18.1 on EC2 m1.xl instances.

We run 10 MR jobs in sequence, 6hr duration, not seeing the problem
repeated after that 1 heap space exception.

Paco


On Wed, Sep 3, 2008 at 11:42 AM, Florian Leibert <[EMAIL PROTECTED]>  
wrote:

Hi,
we're running 100 XLarge instances (ec2), with a gig of heap space  
for each
task - and are seeing the following error frequently (but not  
always):

# BEGIN PASTE #
  [exec] 08/09/03 11:21:09 INFO mapred.JobClient:  map 43% reduce 5%
  [exec] 08/09/03 11:21:16 INFO mapred.JobClient: Task Id :
attempt_200809031101_0001_m_000220_0, Status : FAILED
  [exec] java.io.IOException: Spill failed
  [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.flush(MapTask.java:688)
  [exec] at org.apache.hadoop.mapred.MapTask.run(MapTask.java: 
228)

  [exec] at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java: 
2209)

  [exec] Caused by: java.lang.OutOfMemoryError: Java heap space
  [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
$InMemValBytes.reset(MapTask.java:928)

  [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.getVBytesForOffset(MapTask.java:891)

  [exec] at
org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.sortAndSpill(MapTask.java:765)

  [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access 
$1600(MapTask.java:286)

  [exec] at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer 
$SpillThread.run(MapTask.java:712)

# END #

Has anyone seen this? Thanks,

Florian Leibert
Sr. Software Engineer
Adknowledge Inc.







Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread Ariel Rabkin
I don't understand this use case.

Suppose that you lose half the nodes in the cluster.  On average,
12.5% of your blocks were exclusively stored on the half the cluster
that's dead.  For many (most?) applications, a random 87.5% of the
data isn't really useful.  Storing metadata in more places would let
you turn a dead cluster into a corrupt cluster, but not into a working
one.   If you need to survive major disasters, you want a second HDFS
cluster in a different place.

The thing that might be useful to you, if you're worried about
simultaneous namenode and secondary NN failure, is to store the edit
log and fsimage on a SAN, and get fault tolerance that way.

--Ari

On Tue, Sep 9, 2008 at 6:38 PM, 叶双明 <[EMAIL PROTECTED]> wrote:
> Thanks for paying attention  to my tentative idea!
>
> What I thought isn't how to store the meradata, but the final (or last) way
> to recover valuable data in the cluster when something worst (which destroy
> the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
> natural disasters destroy half of cluster nodes within all NameNode, we can
> recover as much data as possible by this mechanism, and hava big chance to
> recover entire data of cluster because fo original replication.
>
> Any suggestion is appreciate!
>
> 2008/9/10 Pete Wyckoff <[EMAIL PROTECTED]>
>
>> +1 -
>>
>> from the perspective of the data nodes, dfs is just a block-level store and
>> is thus much more robust and scalable.
>>
>>
>>
>> On 9/9/08 9:14 AM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote:
>>
>> > This isn't a very stable direction. You really don't want multiple
>> distinct
>> > methods for storing the metadata, because discrepancies are very bad.
>> High
>> > Availability (HA) is a very important medium term goal for HDFS, but it
>> will
>> > likely be done using multiple NameNodes and ZooKeeper.
>> >
>> > -- Owen
>>

-- 
Ari Rabkin [EMAIL PROTECTED]
UC Berkeley Computer Science Department


Ordering of records in output files?

2008-09-10 Thread Joel Welling
Hi folks;
  I have a simple Streaming job where the mapper produces output records
beginning with a 16 character ascii string and passes them to
IdentityReducer.  When I run it, I get the same number of output files
as I have mapred.reduce.tasks .  Each one contains some of the strings,
and within each file the strings are in sorted order.
  But there is no obvious ordering *across* the files.  For example, I
can see where the first few strings in the output went to files 0,1,3,4,
and then back to 0, but none of them ended up in file 2.
  What's the algorithm that determines which strings end up in which
files?  Is there a way I can change it so that sequentially ordered
strings end up in the same file rather than spraying off across all the
files?

Thanks,
-Joel
 [EMAIL PROTECTED]



Re: Simple Survey

2008-09-10 Thread Chris K Wensel

They say its fixed now.

On Sep 9, 2008, at 4:22 PM, John Kane wrote:

Unfortunately there is a problem with the survey. I was unable to  
answer
correctly question '9. How much data is stored on your Hadoop  
cluster (in
GB)?' It would not let me enter more than 10TB (we currently have  
45TB of
data in our cluster; actual data, not a sum of disk used (with all  
of its

replicas) but unique data).

Other than that, I tried :-)

On Tue, Sep 9, 2008 at 4:01 PM, Chris K Wensel <[EMAIL PROTECTED]>  
wrote:




Quick reminder to take the survey. We know more than a dozen  
companies are

using Hadoop. heh

http://www.scaleunlimited.com/survey.html

thanks!
chris


On Sep 8, 2008, at 10:43 AM, Chris K Wensel wrote:

Hey all


Scale Unlimited is putting together some case studies for an  
upcoming
class and wants to get a snapshot of what the Hadoop user  
community looks

like.

If you have 2 minutes, please feel free to take the short  
anonymous survey

below:

http://www.scaleunlimited.com/survey.html

All results will be public.

cheers,
chris

--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/




--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/



Re: output multiple values?

2008-09-10 Thread Shirley Cohen
Thanks Owen! I found the bug in my code: Doing collect twice does  
work now :))


Shirley

On Sep 9, 2008, at 4:19 PM, Owen O'Malley wrote:



On Sep 9, 2008, at 12:20 PM, Shirley Cohen wrote:

I have a simple reducer that computes the average by doing a sum/ 
count. But I want to output both the average and the count for a  
given key, not just the average. Is it possible to output both  
values from the same invocation of the reducer? Or do I need two  
reducer invocations? If I try to call output.collect() twice from  
the reducer and label the key with "type=avg" or "type=count", I  
get a bunch of garbage out. Please let me know if you have any  
suggestions.


I'd be tempted to define a type like:

class AverageAndCount implements Writable {
  private long sum;
  private long count;
  ...
  public String toString() {
 return "avg = " + (sum / (double) count) + ", count = " + count);
  }
}

Then you could use your reducer as both a combiner and reducer and  
you would get both values out if you use TextOutputFormat. That  
said, it should absolutely work to do collect twice.


-- Owen




Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread 叶双明
No need to update all blocks when play with metadata, just store the
additional information when store the block.

2008/9/10 Dmitry Pushkarev <[EMAIL PROTECTED]>

> This will effectively ruin system on large scale. Since you will have to
> update all blocks when you play with metadata...
>
> -Original Message-
> From: 叶双明 [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, September 10, 2008 12:06 AM
> To: core-user@hadoop.apache.org
> Subject: Re: Thinking about retriving DFS metadata from datanodes!!!
>
> I think let each block carry three simple additional information which
> doesn't use in normal situation:
>   1. which file that it belong to
>   2. which block is it in the file
>   3. how many blocks of the file
> After the cluster system has been destroy, we can set up new NameNode , and
> then , rebuild metadata from the information reported from datanodes.
>
> And the cost is  a little disk space, indeed less than 1k each block I
> think.  I don't think it replace of multiple NameNodes or compare to , but
> just a  possible mechanism to recover data, the point is '"recover".
>
> hehe~~ thanks.
>
> 2008/9/10 Raghu Angadi <[EMAIL PROTECTED]>
>
> >
> > The main problem is the complexity of maintaining accuracy of the
> metadata.
> > In other words, what you think is the cost?
> >
> > Do you think writing fsimage to multiple places helps with the terrorist
> > attack? It is supported even now.
> >
> > Raghu.
> >
> > 叶双明 wrote:
> >
> >> Thanks for paying attention  to my tentative idea!
> >>
> >> What I thought isn't how to store the meradata, but the final (or last)
> >> way
> >> to recover valuable data in the cluster when something worst (which
> >> destroy
> >> the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
> >> natural disasters destroy half of cluster nodes within all NameNode, we
> >> can
> >> recover as much data as possible by this mechanism, and hava big chance
> to
> >> recover entire data of cluster because fo original replication.
> >>
> >> Any suggestion is appreciate!
> >>
> >
> >
> >
>
>
> --
> Sorry for my english!! 明
>
>


-- 
Sorry for my english!! 明


RE: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread Dmitry Pushkarev
This will effectively ruin system on large scale. Since you will have to update 
all blocks when you play with metadata...

-Original Message-
From: 叶双明 [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 10, 2008 12:06 AM
To: core-user@hadoop.apache.org
Subject: Re: Thinking about retriving DFS metadata from datanodes!!!

I think let each block carry three simple additional information which
doesn't use in normal situation:
   1. which file that it belong to
   2. which block is it in the file
   3. how many blocks of the file
After the cluster system has been destroy, we can set up new NameNode , and
then , rebuild metadata from the information reported from datanodes.

And the cost is  a little disk space, indeed less than 1k each block I
think.  I don't think it replace of multiple NameNodes or compare to , but
just a  possible mechanism to recover data, the point is '"recover".

hehe~~ thanks.

2008/9/10 Raghu Angadi <[EMAIL PROTECTED]>

>
> The main problem is the complexity of maintaining accuracy of the metadata.
> In other words, what you think is the cost?
>
> Do you think writing fsimage to multiple places helps with the terrorist
> attack? It is supported even now.
>
> Raghu.
>
> 叶双明 wrote:
>
>> Thanks for paying attention  to my tentative idea!
>>
>> What I thought isn't how to store the meradata, but the final (or last)
>> way
>> to recover valuable data in the cluster when something worst (which
>> destroy
>> the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
>> natural disasters destroy half of cluster nodes within all NameNode, we
>> can
>> recover as much data as possible by this mechanism, and hava big chance to
>> recover entire data of cluster because fo original replication.
>>
>> Any suggestion is appreciate!
>>
>
>
>


-- 
Sorry for my english!! 明



Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-10 Thread 叶双明
I think let each block carry three simple additional information which
doesn't use in normal situation:
   1. which file that it belong to
   2. which block is it in the file
   3. how many blocks of the file
After the cluster system has been destroy, we can set up new NameNode , and
then , rebuild metadata from the information reported from datanodes.

And the cost is  a little disk space, indeed less than 1k each block I
think.  I don't think it replace of multiple NameNodes or compare to , but
just a  possible mechanism to recover data, the point is '"recover".

hehe~~ thanks.

2008/9/10 Raghu Angadi <[EMAIL PROTECTED]>

>
> The main problem is the complexity of maintaining accuracy of the metadata.
> In other words, what you think is the cost?
>
> Do you think writing fsimage to multiple places helps with the terrorist
> attack? It is supported even now.
>
> Raghu.
>
> 叶双明 wrote:
>
>> Thanks for paying attention  to my tentative idea!
>>
>> What I thought isn't how to store the meradata, but the final (or last)
>> way
>> to recover valuable data in the cluster when something worst (which
>> destroy
>> the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
>> natural disasters destroy half of cluster nodes within all NameNode, we
>> can
>> recover as much data as possible by this mechanism, and hava big chance to
>> recover entire data of cluster because fo original replication.
>>
>> Any suggestion is appreciate!
>>
>
>
>


-- 
Sorry for my english!! 明