Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-27 Thread Mork0075
Thank you very much for your effort!

> So it really depends on what you want to use it for.  If you're
> thinking about it, you probably have some kind of scale issues.

Not at the moment. Actually our software runs on a single server, web
server/database/file storage/lucene side by side. But we're planing to
create a structured approach, on how to scale in future. For the
database layer there're options like Master/Slave, Master/Master,
Clustering, Sharding AND HBase. HBase offers unlimited scaling
opportunities, with some drawbacks compared to relational dbs at the
moment. One of our considerations is "stay with RDBMS and perhaps got
some huge scaling issues in the future or decide for HBase and PERHAPS
benefit from it in the future". What do you mean, is this the right way
to think about the whole topic?

As i can read, companies like Flickr, Digg and so on, put huge effort in
sharding their databases. Would HBase be suitable for this scenario and
would it had saved the trouble?

> Again, typical small web apps are far simpler to write using SQL
> When building large scale web applications, this is what you want.

and thats the problem :) You don't know when and if your web
applications becomes popular/large. As a computer scientist you're
always anxious to keep the future (growings) in mind (and even more if
it "costs" that less then HBase)

Jonathan Gray schrieb:
> Discussion inline.
> 
>> You example with the friends makes perfectly sense. Can you imagine a
>> scenario where storing the data in column oriented instead of row
>> oriented db (so if you will an counterexample) causes such a huge
>> performance mismatch, like the friends one in row/column comparison?
> 
> There's quite a few advantages in a typical row-oriented database.  For one, 
> normalization is more space efficient, though this is increasingly less 
> important, especially in distributed file systems and with drives as cheap as 
> they are.
> 
> A big difference you'll find between organizations running large relational 
> clusters compared to large hbase/hadoop clusters is the type of hardware 
> used.  Relational databases can (and should) be memory hogs but worst of all 
> they need fast disk i/o.  Large 15k rpm SAS RAID-10 arrays are often used 
> costing tens of thousands of dollars.  Each server might have 15 drives or 
> more, 8 or more cores, and 32 gigs of ram.
> 
> In contrast, HBase/Hadoop clusters are most often made with "commodity" 
> hardware, a decent processor (quad core xeon) and 4+ gb of memory in a 1U 
> runs about $1000.  The i/o matters much less with this architecture and 
> that's where most cost can be associated when purchasing relational database 
> hardware.
> 
> One tradeoff of course is that you will have far more actual machines with 
> HBase/Hadoop but a node going down is not a big deal whereas it can be a much 
> bigger emergency if the number of total nodes is low.
> 
> 
>>From a performance standpoint, there's no contest for relational databases 
>>and their ability to index on different columns and randomly access records.  
>>If you have dense data and you want to query it randomly or with ordering and 
>>limiting/offsetting (in soft realtime), you're going to be in for some tough 
>>times with HBase.  This is definitely where relational DBs shine.  
> 
> In HBase things like this require lots of denormalization and cleverness 
> (though many times table scans are the only way), or in my case writing a 
> separate logic/application/caching layer on top of HBase.  For us, HBase is a 
> store that backs our caches and indexes.  Those are allowed to fail as they 
> can be fully recovered from HBase and also implement LRU-like algorithms as 
> our total dataset is on the order of terabytes.  Some of this caching can be 
> compared to Memcache for SQL, but indexing/joining is unique.
> 
> So it really depends on what you want to use it for.  If you're thinking 
> about it, you probably have some kind of scale issues.  Either in the size of 
> your dataset, your need to process a very large dataset in batch and in 
> parallel, or strong need for replication/fault-tolerance.  Though it's also 
> good to just explore because many of us think this is the direction things 
> are going :)
> 
> And things will get better!  Currently, there is already work being done with 
> indexing.  I have personally created join/merge logic that will be 
> contributed in the future.  But the larger issue at hand is the relatively 
> poor random read performance of HBase compared to that of relational 
> databases.  This is inextricably linked to HDFS, which is most often tuned 
> for batch processing rather than random access (a vast majority of Hadoop 
> users are using it in this way).  However, improvements are definitely being 
> made within both HBase and Hadoop.  You can follow the performance statistics 
> here:  http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation
> 
> Between 0.17 and 0.18, random read pe

Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-08-27 Thread wangxu
Hi,all
I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
and running hadoop on one namenode and 4 slaves.
attached is my hadoop-site.xml, and I didn't change the file
hadoop-default.xml

when data in segments are large,this kind of errors occure:

java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129 
file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1462)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1312)
at 
org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417)
at java.io.DataInputStream.readFully(DataInputStream.java:178)
at 
org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
at 
org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at 
org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
at 
org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.java:1712)
at 
org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1787)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceFileRecordReader.java:104)
at 
org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordReader.java:79)
at 
org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.java:112)
at 
org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReader.java:130)
at 
org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(CompositeRecordReader.java:398)
at 
org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:56)
at 
org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:33)
at 
org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)


how can I correct this?
thanks.
Xu








  mapred.map.tasks
  41
  The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".  
  



  mapred.reduce.tasks
  8
  The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  



  hadoop.tmp.dir
  /mnt/nutch



  fs.default.name
  hdfs://namenode:50001/



  mapred.job.tracker
  namenode:50002



  tasktracker.http.threads
  80



  mapred.tasktracker.map.tasks.maximum
  2



  mapred.tasktracker.reduce.tasks.maximum
  2



  mapred.output.compress
  true



  mapred.output.compression.type
  BLOCK



  dfs.client.block.write.retries
  3







RE: Optimizations

2008-08-27 Thread Ryan Lynch
More than likely you would be best served by outputting a file from
Hadoop in your map/reduce job and importing that directly into the
database. Databases typically support a way to do bulk loading of data
from a file like sqlldr for Oracle. Using reduce to insert data may
cause many short connections to the database that could result in many
small transactions instead of a single large transaction using a file.

Regards,
Ryan

-Original Message-
From: Yih Sun Khoo [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 27, 2008 2:28 PM
To: core-user@hadoop.apache.org
Subject: Optimizations

Optimizations



Right now I have a job whose reducer phase outputs the key-value pairs
as
records into a database.  Is this the best way to be loading the
database?  What
are some alternatives?


RE: Design of the new job tracker

2008-08-27 Thread Vivek Ratan
There are a number of Jiras that modify the scheduling piece of the
JobTracker. 

- 3412 refactors the scheduler code out of the JT to make schedulers
more pluggable
- 3445 and 3746 are a couple of new schedulers that do more than the
default JT scheduler. Both these can be good alternatives to HOD. 3445
implements the requirements in 3421. 
- There are other Jiras, especially some linked off of 3444, that deal
with other aspects of resource management and scheduling. 

> -Original Message-
> From: Yiping Han [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, August 28, 2008 12:12 AM
> To: core-user@hadoop.apache.org
> Subject: Design of the new job tracker
> 
> Hi,
> 
> I want to know where is the detailed description of the next 
> gen. job tracker, which replaces hod? Thanks~
> 
> 
> --Yiping Han
> 
> 


Re: JobTracker Web interface sometimes does not display in IE7

2008-08-27 Thread Edward J. Yoon
Hi owen,

There is no solution on the server. Instead, I think we can add some
guide on how to fix this, or to explain this phenomenon.

On Thu, Aug 28, 2008 at 1:16 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> Also consider making a patch to fix the behavior and submit it.
>

-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: Optimizations

2008-08-27 Thread Edward J. Yoon
Recently, most DBMS have a bulk insert mechanism instead of each
transaction. Check it.

-Edward

On Thu, Aug 28, 2008 at 6:27 AM, Yih Sun Khoo <[EMAIL PROTECTED]> wrote:
> Optimizations
>
>
>
> Right now I have a job whose reducer phase outputs the key-value pairs as
> records into a database.  Is this the best way to be loading the
> database?  What
> are some alternatives?
>



-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


MultipleOutputFormat versus MultipleOutputs

2008-08-27 Thread Shirley Cohen

Hi,

I would like the reducer to output to different files based upon the  
value of the key. I understand that both MultipleOutputs and  
MultipleOutputFormat can do this. Is that correct? However, I don't  
understand the differences between these two classes. Can someone  
explain the differences and provide an example to illustrate these  
differences? I found a snippet of code on how to use MultipleOutputs  
in the documentation, but could not find an example for using  
MultipleOutputFormat.


Thanks in advance,

Shirley




RE: how use only a reducer without a mapper

2008-08-27 Thread Joydeep Sen Sarma
It would be useful to have no-sort option in the map stage (ideally on a
per file basis - perhaps using a regex).

With sorted data sets - the re-sorting is often unnecessary. As well -
one can have operations that deal with a mix of sorted and unsorted data
(a merge of a sorted table with new unsorted entries would be a good
example - where part of the data set needs to be sorted and then merged
with previously sorted data).

Although - I am not sure of the typical cost of the map-side sort
relative to the overall job.


-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, August 27, 2008 9:28 AM
To: core-user@hadoop.apache.org
Subject: Re: how use only a reducer without a mapper

The down side of this (which appears to be the only way) is that your 
entire input data set has to pass through the identity mapper and then 
go through shuffle and sort before it gets to the reducer.
If you have a large input data set, this takes real resources - cpu, 
disk, network and wall clock time.

What we have been doing is making map files of our data sets, and 
running the Join code on them, then we have reduce equivalent capability

in the mapper.

Richard Tomsett wrote:
> Leandro Alvim wrote:
>> How can i use only a reduce without map?
>>   
>
> I don't know if there's a way to run just a reduce task without a map 
> stage, but you could do it by having a map stage just using the 
> IdentityMapper class (which passes the data through to the reducers 
> unchanged), so effectively just doing a reduce.
-- 
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Optimizations

2008-08-27 Thread Yih Sun Khoo
Optimizations



Right now I have a job whose reducer phase outputs the key-value pairs as
records into a database.  Is this the best way to be loading the
database?  What
are some alternatives?


Design of the new job tracker

2008-08-27 Thread Yiping Han
Hi,

I want to know where is the detailed description of the next gen. job
tracker, which replaces hod? Thanks~


--Yiping Han



Re: questions on sorting big files and sorting order

2008-08-27 Thread Tarandeep Singh
On Tue, Aug 26, 2008 at 7:50 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

> On Tue, Aug 26, 2008 at 12:39 AM, charles du <[EMAIL PROTECTED]> wrote:
>
> > I would like to sort a large number of records in a big file based on a
> > given field (key).
>
>
> The property you are looking for is a "total order" and you need to define
> your own partitioner class to do it. Look at the terasort example and how I
> did it in that program. Roughly, before the job the input is sampled and
> the
> proper split points are chosen.  When each partitioner picks where each key
> should go, it looks at the split points and sends it to the right reduce.
>
> *http://tinyurl.com/5ltb2a


how to sort if key is not text, say my records are -
abc 10 30.5 lmn

and I want to sort on field 2 and 3

Can you pls give some pointers on how to modify your original partitioner
class to handle this case.

thanks,
Taran


>
> -- Owen
> *
>


Re: Real use-case

2008-08-27 Thread Jeff Payne
In order to do anything other than a tar transfer (which is a kludge, of
course), you'll need to open up the relevant ports between the client and
the hadoop cluster.  I may miss a few here, but I believe these would
include port 50010 for the datanodes and whatever port the namenode is
listening on.

Once you've done this, install hadoop on your client machine (but don't
start it) and use the command line tools directly from there.

On Wed, Aug 27, 2008 at 12:59 AM, Victor Samoylov <
[EMAIL PROTECTED]> wrote:

> Jeff,
>
> Thanks for help, I want to clarify several details:
>
> 1. I know this way to import files to HDFS, but this is connected with
> direct accessing HDFS nodes by user.
> Does exist another way export all data files from data server side to
> remote
> HDFS nodes without tar invocation?
>
> 2. I've setup replication factor as 2. How to setup 50 GB size of FS on one
> data node?
>
> Thanks,
> Victor Samoylov
>
> On Wed, Aug 27, 2008 at 3:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:
>
> > Victor:
> >
> > I think in your use case the best way to move the data into hadoop would
> > either be to tar it up and move it to the same network the HDFS machines
> > are
> > on, untar it and then run...
> >
> >  hadoop dfs -put /contents-path /dfs-path
> >
> > If you only want a replication factor of 2 (the default is 3), open up
> the
> > hadoop.site.xml file and add this snippet...
> >
> > 
> >  dfs.replication
> >  2
> > 
> >
> > --
> > Jeffrey Payne
> > Lead Software Engineer
> > Eyealike, Inc.
> > [EMAIL PROTECTED]
> > www.eyealike.com
> > (206) 257-8708
> >
> >
> > "Anything worth doing is worth overdoing."
> > -H. Lifter
> >
> > On Tue, Aug 26, 2008 at 2:54 PM, Victor Samoylov <
> > [EMAIL PROTECTED]
> > > wrote:
> >
> > > Hi,
> > >
> > > I want to use HDFS as DFS to store files.  I have one data server with
> > 50Gb
> > > data and I plan to use 3 new machines with installed HDFS to duplicate
> > this
> > > data.
> > > These 3 machines are: 1 name node, 2 data nodes. The duplication factor
> > for
> > > all files is 2.
> > >
> > > My questions are:
> > > 1. How could I create 50 GB data node on one server? Actually I'm very
> > > insteresting with setting 50 GB size for data node.
> > > 2. What is the best way to export all data files from external server
> > (ssh
> > > access) to new ones with HDFS?
> > >
> > > Thanks,
> > > Victor Samoylov
> > >
> >
>



-- 
Jeffrey Payne
Lead Software Engineer
Eyealike, Inc.
[EMAIL PROTECTED]
www.eyealike.com
(206) 257-8708


"Anything worth doing is worth overdoing."
-H. Lifter


Re: how use only a reducer without a mapper

2008-08-27 Thread Owen O'Malley
On Wed, Aug 27, 2008 at 9:27 AM, Jason Venner <[EMAIL PROTECTED]> wrote:

> The down side of this (which appears to be the only way) is that your
> entire input data set has to pass through the identity mapper and then go
> through shuffle and sort before it gets to the reducer.


If you don't need the sort, then just put all of your processing in to the
mapper and use reduces = 0. The framework will not sort the data and the
output of the map will be sent straight to the OutputFormat. There is no
reduce in that case. This is how you write jobs that don't have any
interaction or data flow between the tasks.

-- Owen


Re: how use only a reducer without a mapper

2008-08-27 Thread Jason Venner
The down side of this (which appears to be the only way) is that your 
entire input data set has to pass through the identity mapper and then 
go through shuffle and sort before it gets to the reducer.
If you have a large input data set, this takes real resources - cpu, 
disk, network and wall clock time.


What we have been doing is making map files of our data sets, and 
running the Join code on them, then we have reduce equivalent capability 
in the mapper.


Richard Tomsett wrote:

Leandro Alvim wrote:

How can i use only a reduce without map?
  


I don't know if there's a way to run just a reduce task without a map 
stage, but you could do it by having a map stage just using the 
IdentityMapper class (which passes the data through to the reducers 
unchanged), so effectively just doing a reduce.

--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: how use only a reducer without a mapper

2008-08-27 Thread Richard Tomsett

Leandro Alvim wrote:

How can i use only a reduce without map?
  


I don't know if there's a way to run just a reduce task without a map 
stage, but you could do it by having a map stage just using the 
IdentityMapper class (which passes the data through to the reducers 
unchanged), so effectively just doing a reduce.


Re: JobTracker Web interface sometimes does not display in IE7

2008-08-27 Thread Owen O'Malley
Also consider making a patch to fix the behavior and submit it.


Re: possibility to start reducer only after mapper completed certain percentage

2008-08-27 Thread Owen O'Malley
On Tue, Aug 26, 2008 at 11:54 PM, Pallavi Palleti <[EMAIL PROTECTED]>wrote:

> Where, I am building a dictionary by collecting the same
> in pieces from mapper jobs. I will be using this dictionary in reduce()
> method. Can some one please help me if I can put a constraint over reducer
> startup time?


There isn't a way to do this and it would break the architecture in a pretty
fundamental way. I'm not sure what you are doing exactly, but if there is
data flow between the maps and reduces, you should be passing it through the
"normal" data path using collect. Your sort comparator would need to sort it
before the normal data.

-- Owen


RE: Why is scaling HBase much simpler then scaling a relational db?

2008-08-27 Thread Jonathan Gray
Discussion inline.

> You example with the friends makes perfectly sense. Can you imagine a
> scenario where storing the data in column oriented instead of row
> oriented db (so if you will an counterexample) causes such a huge
> performance mismatch, like the friends one in row/column comparison?

There's quite a few advantages in a typical row-oriented database.  For one, 
normalization is more space efficient, though this is increasingly less 
important, especially in distributed file systems and with drives as cheap as 
they are.

A big difference you'll find between organizations running large relational 
clusters compared to large hbase/hadoop clusters is the type of hardware used.  
Relational databases can (and should) be memory hogs but worst of all they need 
fast disk i/o.  Large 15k rpm SAS RAID-10 arrays are often used costing tens of 
thousands of dollars.  Each server might have 15 drives or more, 8 or more 
cores, and 32 gigs of ram.

In contrast, HBase/Hadoop clusters are most often made with "commodity" 
hardware, a decent processor (quad core xeon) and 4+ gb of memory in a 1U runs 
about $1000.  The i/o matters much less with this architecture and that's where 
most cost can be associated when purchasing relational database hardware.

One tradeoff of course is that you will have far more actual machines with 
HBase/Hadoop but a node going down is not a big deal whereas it can be a much 
bigger emergency if the number of total nodes is low.


>From a performance standpoint, there's no contest for relational databases and 
>their ability to index on different columns and randomly access records.  If 
>you have dense data and you want to query it randomly or with ordering and 
>limiting/offsetting (in soft realtime), you're going to be in for some tough 
>times with HBase.  This is definitely where relational DBs shine.  

In HBase things like this require lots of denormalization and cleverness 
(though many times table scans are the only way), or in my case writing a 
separate logic/application/caching layer on top of HBase.  For us, HBase is a 
store that backs our caches and indexes.  Those are allowed to fail as they can 
be fully recovered from HBase and also implement LRU-like algorithms as our 
total dataset is on the order of terabytes.  Some of this caching can be 
compared to Memcache for SQL, but indexing/joining is unique.

So it really depends on what you want to use it for.  If you're thinking about 
it, you probably have some kind of scale issues.  Either in the size of your 
dataset, your need to process a very large dataset in batch and in parallel, or 
strong need for replication/fault-tolerance.  Though it's also good to just 
explore because many of us think this is the direction things are going :)

And things will get better!  Currently, there is already work being done with 
indexing.  I have personally created join/merge logic that will be contributed 
in the future.  But the larger issue at hand is the relatively poor random read 
performance of HBase compared to that of relational databases.  This is 
inextricably linked to HDFS, which is most often tuned for batch processing 
rather than random access (a vast majority of Hadoop users are using it in this 
way).  However, improvements are definitely being made within both HBase and 
Hadoop.  You can follow the performance statistics here:  
http://wiki.apache.org/hadoop/Hbase/PerformanceEvaluation

Between 0.17 and 0.18, random read performance jumped 60%.  So things are 
definitely getting better, there's many people working hard at both projects.  
I suspect we will also see in-memory tables in the next few months which can 
help if you have only certain tables that need fast random access.

I think I got a bit off topic but hopefully you find something useful out of 
it...


> Can you please provide an example of "good de-normalization" in HBase
> and how its held consitent (in your friends example in a relational db,
> there would be a cascadingDelete)? As i think of the users table: if i
> delete an user with the userid='123', then if have to walk through all
> of the other users column-family "friends" to guranty consitency?! Is
> de-normalization in HBase only used to avoid joins? Our webapp doenst
> use joins at the moment anyway.

You lose any concept of foreign keys.  You have a primary key, that's it.  No 
secondary keys/indexes, no foreign keys.

It's the responsibility of your application to handle something like deleting a 
friend and cascading to the friendships.  Again, typical small web apps are far 
simpler to write using SQL, you become responsible for some of the things that 
were once handled for you.

When building large scale web applications, this is what you want.  Control.  
Like programming in C versus Java (no flame war intended, I love them both), 
that control comes at the cost of complexity.  But when you need to scale in a 
relational database you so often end up hacking away at it, 

IsolationRunner [was Re: extracting input to a task from a (streaming) job?]

2008-08-27 Thread Yuri Pradkin
I posted this a while back and have been wondering whether I missed something 
and the doc is out of date or this is a bug and I should file a jira.  Is 
there anyone out there who is successfully using IsolationRunner?  Please let 
me know.

Thanks,

  -Yuri

On Friday 08 August 2008 10:09:48 Yuri Pradkin wrote:

> > >I believe you should set "keep.failed.tasks.files" to true -- this way,
> > > give a task id, you can see what input files it has in ~/
> > >taskTracker/${taskid}/work (source:
> > >http://hadoop.apache.org/core/docs/r0.17.0/mapred_tutorial.html#Isolatio
> > >nR unner )

I forgot to add: I set
  keep.failed.task.files
  true

Note that doc calls it keep.failed.tasks.files (tasks plural) which doesn't 
match the code.

>
> IsolationRunner does not work as described in the tutorial.  After the task
> hung, I failed it via the web interface.  Then I went to the node that was
> running this task
>
>   $ cd ...local/taskTracker/jobcache/job_200808071645_0001/work
> (this path is already different from the tutorial's)
>
>   $ hadoop org.apache.hadoop.mapred.IsolationRunner ../job.xml
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.hadoop.mapred.IsolationRunner.main(IsolationRunner.java:164)
>
> Looking at IsolationRunner code, I see this:
>
> 164 File workDirName = new File(lDirAlloc.getLocalPathToRead(
> 165   TaskTracker.getJobCacheSubdir()
> 166   + Path.SEPARATOR +
> taskId.getJobID() 167   + Path.SEPARATOR +
> taskId 168   + Path.SEPARATOR + "work", 169
>   conf). toString());
>
> I.e. it assumes there is supposed to be a taskID subdirectory under the job
> dir, but:
>  $ pwd
>  ...mapred/local/taskTracker/jobcache/job_200808071645_0001
>  $ ls
>  jars  job.xml  work
>
> -- it's not there.  Any suggestions?
>
> Thanks,
>
>   -Yuri




Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-27 Thread Edward J. Yoon
Hi,

Planet-scale data explorations and data mining operations will almost
always need to include some sequential scans. Then, How can we speed
up sequential scans? BigTable paper shows that.

* Column-oriented storage (it reduces I/O)
* Data compression
* PDP (parallel distributed processing) using Map/Reduce

Also, Matrices which are column major typically perform better with
column-oriented operations, and likewise for row major matrices. See
the Hama/Heart project (http://incubator.apache.org/hama,
http://wiki.apache.org/incubator/HeartProposal) on Hadoop + Hbase.

Salesman, Edward :)

On Wed, Aug 27, 2008 at 4:57 PM, Mork0075 <[EMAIL PROTECTED]> wrote:
> I'am still really interested in these three questions :)
>
>> You example with the friends makes perfectly sense. Can you imagine a
>> scenario where storing the data in column oriented instead of row
>> oriented db (so if you will an counterexample) causes such a huge
>> performance mismatch, like the friends one in row/column comparison?
>
>
>> Can you please provide an example of "good de-normalization" in HBase
>> and how its held consitent (in your friends example in a relational db,
>> there would be a cascadingDelete)? As i think of the users table: if i
>> delete an user with the userid='123', then if have to walk through all
>> of the other users column-family "friends" to guranty consitency?! Is
>> de-normalization in HBase only used to avoid joins? Our webapp doenst
>> use joins at the moment anyway.
>
>> As you describe it, its a problem of implementation. BigTable is
>> designed to scale, there are routines to shard the data, desitribute
>> it to the pool of connected servers. Could MySQL perhaps decide
>> tomorrow to implement something similar or does the relational model
>> avoids this?
>
>
>>
>> Jonathan Gray schrieb:
>>> A few very big differences...
>>>
>>> - HBase/BigTable don't have "transactions" in the same way that a
>>> relational database does.  While it is possible (and was just recently
>>> implemented for HBase, see HBASE-669) it is not at the core of this
>>> design.  A major bottleneck of distributed multi-master relational
>>> databases is distributed transactions/locks.
>>>
>>> - There's a very big difference between storage of
>>> relational/row-oriented databases and column-oriented databases.  For
>>> example, if I have a table of 'users' and I need to store friendships
>>> between these users... In a relational database my design is something
>>> like:
>>>
>>> Table: users(pkey = userid)
>>> Table: friendships(userid,friendid,...) which contains one (or maybe
>>> two depending on how it's impelemented) row for each friendship.
>>>
>>> In order to lookup a given users friend, SELECT * FROM friendships
>>> WHERE userid = 'myid';
>>>
>>> This query would use an index on the friendships table to retrieve all
>>> the necessary rows.  Depending on the relational database you might
>>> also be fetching each and every row (entirely) off of disk to be
>>> read.  In a sharded relational database, this would require hitting
>>> every node to get whichever friendships were stored on that node.
>>> There's lots of room for optimizations here but any way you slice it,
>>> you're likely pulling non-sequential blocks off disk.  When you add in
>>> the overhead of ACID transactions this can get slow.
>>>
>>> The cost of this relational query continues to increase as a user adds
>>> more friends.  You also begin to have practical limits.  If I have
>>> millions of users, each with many thousands of potential friends, the
>>> size of these indexes grow exponentially and things get nasty
>>> quickly.  Rather than friendships, imagine I'm storing activity logs
>>> of actions taken by users.
>>>
>>> In a column-oriented database these things scale continuously with
>>> minimal difference between 10 users and 10,000,000 users, 10
>>> friendships and 10,000 friendships.
>>>
>>> Rather than a friendships table, you could just have a friendships
>>> column family in the users table.  Each column in that family would
>>> contain the ID of a friend.  The value could store anything else you
>>> would have stored in the friendships table in the relational model.
>>> As column families are stored together/sequentially on a per-row
>>> basis, reading a user with 1 friend versus a user with 10,000 friends
>>> is virtually the same.  The biggest difference is just in the shipping
>>> of this information across the network which is unavoidable.  In this
>>> system a user could have 10,000,000 friends.  In a relational database
>>> the size of the friendship table would grow massively and the indexes
>>> would be out of control.
>>>
>>>
>>> It's certainly possible to make relational databases "scale".  What
>>> that is about is usually massive optimizations, manual sharding, being
>>> very clever about how you query things, and often de-normalizing.
>>> Index bloat and table bloat can thrash a relational db.
>>>
>>> In HBase, de-normalizing is usuall

Re: how use only a reducer without a mapper

2008-08-27 Thread Miles Osborne
Streaming has the ability to accept as input multiple directories, so that
would enable you to merge two directories

(--is this an assignment? ...)

Miles

2008/8/27 Leandro Alvim <[EMAIL PROTECTED]>

> Hi, I need help if it's possible.
>
> My name is Leandro Alvim and i`m a graduated in computer science in Brazil.
> So, i'm using hadoop in my university project and i used your tutorials to
> learn how to install and run a simple test with python and hadoop. Writing
> my application i faced a problem that i don't know how to solve. Here it
> comes:
>
> My linux shell script:
>
> input_path=/pls_in
> input=pls_input
> output_1=pls_out1
> output_2=pls_out2
> output_3=pls_out3
> output_4=pls_out4
>
> #deleting output directories
> $hadoop_path/bin/hadoop dfs -rmr $output_1
> $hadoop_path/bin/hadoop dfs -rmr $output_2
> $hadoop_path/bin/hadoop dfs -rmr $output_3
> $hadoop_path/bin/hadoop dfs -rmr $output_4
>
> *#mapreduce A*
> $hadoop_path/bin/hadoop jar
> $hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -input $input/*
> -output $output_1 -mapper $mapper_path/mapperA.py -reducer
> $reducer_path/reducerA.py -jobconf mapred.reduce.tasks=2 -jobconf
> mapred.map.tasks=2
>
> *#map B*
> $hadoop_path/bin/hadoop jar
> $hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -mapper
> $mapper_path/mapperB.py -input $input/* -output $output_2 -jobconf
> mapred.reduce.tasks=0 -jobconf mapred.map.tasks=2
>
> *#map C*
> $hadoop_path/bin/hadoop jar
> $hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -mapper
> $mapper_path/mapperC.py -input $output_1/* -output $output_3 -jobconf
> mapred.reduce.tasks=0 -jobconf mapred.map.tasks=2
>
> *#reduce B and C using the outputs *
> $hadoop_path/bin/hadoop jar
> $hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar
> *-reducer*$reducer_path/reducerD.py
> * -input $output_2/* -input $output_3/* *-output $output_4  -jobconf
> mapred.reduce.tasks=2 -jobconf mapred.map.tasks=0
>
> *The problem is here. I can`t reduce using two previous files that was
> mapped in fase (mapB and mapC) to an output. *
>
> How can i use only a reduce without map?
>
> How can i join two directories in one that all files willl be ordened for
> the reducer fase?
>
>
> #listing the outputs
> $hadoop_path/bin/hadoop dfs -cat $output_1/*
> $hadoop_path/bin/hadoop dfs -cat $output_2/*
> $hadoop_path/bin/hadoop dfs -cat $output_3/*
> $hadoop_path/bin/hadoop dfs -cat $output_4/*
>
>
> --
> Atenciosamente,
> Leandro G.M. Alvim
>



-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


how use only a reducer without a mapper

2008-08-27 Thread Leandro Alvim
Hi, I need help if it's possible.

My name is Leandro Alvim and i`m a graduated in computer science in Brazil.
So, i'm using hadoop in my university project and i used your tutorials to
learn how to install and run a simple test with python and hadoop. Writing
my application i faced a problem that i don't know how to solve. Here it
comes:

My linux shell script:

input_path=/pls_in
input=pls_input
output_1=pls_out1
output_2=pls_out2
output_3=pls_out3
output_4=pls_out4

#deleting output directories
$hadoop_path/bin/hadoop dfs -rmr $output_1
$hadoop_path/bin/hadoop dfs -rmr $output_2
$hadoop_path/bin/hadoop dfs -rmr $output_3
$hadoop_path/bin/hadoop dfs -rmr $output_4

*#mapreduce A*
$hadoop_path/bin/hadoop jar
$hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -input $input/*
-output $output_1 -mapper $mapper_path/mapperA.py -reducer
$reducer_path/reducerA.py -jobconf mapred.reduce.tasks=2 -jobconf
mapred.map.tasks=2

*#map B*
$hadoop_path/bin/hadoop jar
$hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -mapper
$mapper_path/mapperB.py -input $input/* -output $output_2 -jobconf
mapred.reduce.tasks=0 -jobconf mapred.map.tasks=2

*#map C*
$hadoop_path/bin/hadoop jar
$hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar -mapper
$mapper_path/mapperC.py -input $output_1/* -output $output_3 -jobconf
mapred.reduce.tasks=0 -jobconf mapred.map.tasks=2

*#reduce B and C using the outputs *
$hadoop_path/bin/hadoop jar
$hadoop_path/contrib/streaming/hadoop-0.17.1-streaming.jar
*-reducer*$reducer_path/reducerD.py
* -input $output_2/* -input $output_3/* *-output $output_4  -jobconf
mapred.reduce.tasks=2 -jobconf mapred.map.tasks=0

*The problem is here. I can`t reduce using two previous files that was
mapped in fase (mapB and mapC) to an output. *

How can i use only a reduce without map?

How can i join two directories in one that all files willl be ordened for
the reducer fase?


#listing the outputs
$hadoop_path/bin/hadoop dfs -cat $output_1/*
$hadoop_path/bin/hadoop dfs -cat $output_2/*
$hadoop_path/bin/hadoop dfs -cat $output_3/*
$hadoop_path/bin/hadoop dfs -cat $output_4/*


-- 
Atenciosamente,
Leandro G.M. Alvim


Re: Load balancing in HDFS

2008-08-27 Thread Mork0075
This sound really interesting. And while increasing the replicas for
certain files, the available troughput for these files increases too?

Allen Wittenauer schrieb:
> 
> 
> On 8/27/08 12:54 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
>> i'am planning to use HDFS as a DFS in a web application evenvironment.
>> There are two requirements: fault tolerence, which is ensured by the
>> replicas and load balancing.
> 
> There is a SPOF in the form of the name node.  So depending upon your
> needs, that may or may not be acceptable risk.
> 
> On 8/27/08 1:23 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
>> Some documents stored in the HDFS could be very popular and
>> therefor accessed more often then others. Then HDFS needs to balance the
>> load - distribute the requests to different nodes. Is i possible?
> 
> Not automatically.  However, it is possible to manually/programmatically
> increase the replication on files.
> 
> This is one of the possible uses for the new audit logging in 0.18... By
> watching the log, it should be possible to determine which files need a
> higher replication factor.
> 
> 



Re: Load balancing in HDFS

2008-08-27 Thread Allen Wittenauer



On 8/27/08 12:54 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
> i'am planning to use HDFS as a DFS in a web application evenvironment.
> There are two requirements: fault tolerence, which is ensured by the
> replicas and load balancing.

There is a SPOF in the form of the name node.  So depending upon your
needs, that may or may not be acceptable risk.

On 8/27/08 1:23 AM, "Mork0075" <[EMAIL PROTECTED]> wrote:
> Some documents stored in the HDFS could be very popular and
> therefor accessed more often then others. Then HDFS needs to balance the
> load - distribute the requests to different nodes. Is i possible?

Not automatically.  However, it is possible to manually/programmatically
increase the replication on files.

This is one of the possible uses for the new audit logging in 0.18... By
watching the log, it should be possible to determine which files need a
higher replication factor.



Re: too many fetch-failures

2008-08-27 Thread Edward J. Yoon
>> when i run example wordcount i have problem like this :

Is wordcount a hadoop example? or your code?

On 8/16/08, tran thien <[EMAIL PROTECTED]> wrote:
> hi everyone,
> i am using hadoop 0.17.1.
> There are 2 node : one master(also slave) and one slave.
> when i run example wordcount i have problem like this :
>
> 08/08/16 11:59:39 INFO mapred.JobClient:  map 100% reduce 22%
> 08/08/16 11:59:48 INFO mapred.JobClient:  map 100% reduce 23%
> 08/08/16 12:02:03 INFO mapred.JobClient: Task Id :
> task_200808161130_0001_m_07_0, Status : FAILED
> Too many fetch-failures
>
> I config hadoop-site.xml like this :
>
> 
> 
>
> 
>
> 
>
> 
>  fs.default.name
>  hdfs://192.168.1.135:54310
>  The name of the default file system.  A URI whose
>  scheme and authority determine the FileSystem implementation.  The
>  uri's scheme determines the config property (fs.SCHEME.impl) naming
>  the FileSystem implementation class.  The uri's authority is used to
>  determine the host, port, etc. for a filesystem.
> 
>
> 
>  mapred.job.tracker
>  192.168.1.135:54311
>  The host and port that the MapReduce job tracker runs
>  at.  If "local", then jobs are run in-process as a single map
>  and reduce task.
>  
> 
>
> 
>  dfs.replication
>  2
>  Default block replication.
>  The actual number of replications can be specified when the file is
> created.
>  The default is used if replication is not specified in create time.
>  
> 
>
> 
>  mapred.map.tasks
>  11
>  The default number of map tasks per job.  Typically set
>  to a prime several times greater than number of available hosts.
>  Ignored when mapred.job.tracker is "local".
>  
> 
>
> 
>  mapred.reduce.tasks
>  7
>  The default number of reduce tasks per job.  Typically
> set
>  to a prime close to the number of available hosts.  Ignored when
>  mapred.job.tracker is "local".
>  
> 
>
> 
>  mapred.tasktracker.map.tasks.maximum
>  5
>  The maximum number of map tasks that will be run
>  simultaneously by a task tracker.
>  
> 
>
> 
>  mapred.tasktracker.reduce.tasks.maximum
>  5
>  The maximum number of reduce tasks that will be run
>  simultaneously by a task tracker.
>  
> 
>
> 
>
>
> I don't know why? Can you help me to resolve this problem?
>
> Thanks for the help in advance,
>
> Regards,
> thientd
>
>
>
>


-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: JobTracker Web interface sometimes does not display in IE7

2008-08-27 Thread Edward J. Yoon
I guess this is not a hadoop bug, you can fix it.

See the 
HKEY_LOCAL_MACHINE>SOFTWARE>Microsoft>Windows>CurrentVersion>URL>DefaultPrefix

It should be "http://";.

Regards, Edward

On 8/27/08, Andy Fraley <[EMAIL PROTECTED]> wrote:
> Minor issue in case anyone else gets tripped up trying to use IE7 to view
> Hadoop Web GUIs like I have been.  My server configuration is all Fedora 9.
> I have two master machines running NameNode and JobTracker and two slave
> machines running DataNode and TaskTracker.  When I access my JobTracker
> server from IE7 on a Windows XP desktop using :50030, it fails
> with error saying "" protocol type unrecognized.  If I specify
> leading http:// as in http://:50030 it works fine.  I only have
> this problem access JobTracker Web GUI.  Accessing NameNode, DataNode, or
> TaskTracker Web GUI from IE7 works fine with or without the leading http://.
> And from Windows XP using either Safari or Firefox, I can access all four
> (NameNode, DataNode, JobTracker, TaskTracker) Web GUIs fine with or without
> leading http:// prefix.  So the leading prefix http:// issue on JobTracker
> seems to only affect IE7.
>
>
>
> -Andy Fraley
>
>  [EMAIL PROTECTED]
>
>
>
>


-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: NameNode formatting issues in 1.16.4 and higher

2008-08-27 Thread Alex Loddengaard
Thanks for your help, Arijit.  After an entire day, I'm up and running!
Yippee!

I had some strangeness going on with LDAP that made Hadoop not able to find
some groups that I'm in.  Solution: got rid of LDAP :).

Thanks again!

Alex

On Tue, Aug 26, 2008 at 5:55 PM, Arijit Mukherjee <
[EMAIL PROTECTED]> wrote:

> Hi
>
> Most likely, it's due to login permissions. Have you set up ssh for
> accessing the nodes? This page might be helpful -
> http://tinyurl.com/6lz6o3 - contains detailed explanation of the steps
> you should follow.
>
> Hope this helps
>
> Cheers
> Arijit
>
> Dr. Arijit Mukherjee
> Principal Member of Technical Staff, Level-II
> Connectiva Systems (I) Pvt. Ltd.
> J-2, Block GP, Sector V, Salt Lake
> Kolkata 700 091, India
> Phone: +91 (0)33 23577531/32 x 107
> http://www.connectivasystems.com
>
>
> -Original Message-
> From: Alex Loddengaard [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, August 26, 2008 3:17 PM
> To: core-user@hadoop.apache.org
> Subject: NameNode formatting issues in 1.16.4 and higher
>
>
> First post to the list; excited to be joining the community!
>
> I've been following the getting started guide (<
> http://wiki.apache.org/hadoop/GettingStartedWithHadoop>) in hopes of
> getting just a single-node Hadoop instance up and running, and I've run
> into an issue that I can't seem to find anywhere.
>
> While attempting to format my NameNode for the first time, I'm getting
> the following exception:
>
> ERROR namenode.NameNode: java.io.IOException:
> > javax.security.auth.login.LoginException: Login failed: id: cannot
> > find name for group ID 5001
> > id: cannot find name for group ID 5221
> >
> > at
> >
> org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
> nformation.java:250)
> > at
> >
> org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
> nformation.java:275)
> > at
> >
> org.apache.hadoop.security.UnixUserGroupInformation.login(UnixUserGroupI
> nformation.java:257)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setConfigurationPara
> meters(FSNamesystem.java:405)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.
> java:394)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:759
> )
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.
> java:841)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:858
> > )
> >
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.setConfigurationPara
> meters(FSNamesystem.java:407)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.
> java:394)
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.format(NameNode.java:759
> )
> > at
> >
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.
> java:841)
> > at
> > org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:858
> > )
> >
>
> Firstly, I'm not familiar with LoginException at all, but after Googling
> for a while, most of the issues came about by errors with /etc/groups
> and permissions; I couldn't find a solution, though.  What's most
> curious is that this issue doesn't come up with 0.15.2, but it does come
> up with every later version, including 0.18.0 and trunk.
>
> Anyone have any ideas?  In the meantime I'm going to try to dig in to
> org.apache.hadoop.security.UnixUserGroupInformation along with best
> practices for using LoginException.  I'm also going to try and create a
> hadoop user and group and run the NameNode format process through that
> user.
>
> Thanks ahead of time for pulling me out of the trenches.
>
> Alex
> No virus found in this incoming message.
> Checked by AVG.
> Version: 8.0.100 / Virus Database: 270.6.9/1634 - Release Date:
> 8/25/2008 8:48 PM
>
>
>


Re: Load balancing in HDFS

2008-08-27 Thread Mork0075
Thanks for your reply. But with load balancing i mean the increased
access load. Some documents stored in the HDFS could be very popular and
therefor accessed more often then others. Then HDFS needs to balance the
load - distribute the requests to different nodes. Is i possible?

lohit schrieb:
> If you have a fixed set of nodes in cluster and load data onto HDFS, it tries 
> to automatically balance the distribution across nodes by selecting random 
> nodes to store replicas. This has to be done with a client which is outside 
> the datanodes for random distribution. If you add new nodes to your cluster 
> or would like to rebalance your cluster you could use the rebalancer utility
> http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Rebalancer
> 
> -Lohit
> 
> 
> - Original Message 
> From: Mork0075 <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Wednesday, August 27, 2008 12:54:48 AM
> Subject: Load balancing in HDFS
> 
> Hello,
> 
> i'am planning to use HDFS as a DFS in a web application evenvironment.
> There are two requirements: fault tolerence, which is ensured by the
> replicas and load balancing. Is load balancing part of HDFS and how is
> it configurable?
> 
> Thanks a lot
> 
> 



How can I debug the hadoop source code within eclipse

2008-08-27 Thread li luo
Hi, All
I am deploy hadoop in my Linux computer as a single-node environment.And
everything running good.Now I want to follow the running step of hadoop job
step by step in Eclipse.please tell me how  I can do that?
Thanks


Re: Load balancing in HDFS

2008-08-27 Thread lohit
If you have a fixed set of nodes in cluster and load data onto HDFS, it tries 
to automatically balance the distribution across nodes by selecting random 
nodes to store replicas. This has to be done with a client which is outside the 
datanodes for random distribution. If you add new nodes to your cluster or 
would like to rebalance your cluster you could use the rebalancer utility
http://hadoop.apache.org/core/docs/current/hdfs_user_guide.html#Rebalancer

-Lohit


- Original Message 
From: Mork0075 <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, August 27, 2008 12:54:48 AM
Subject: Load balancing in HDFS

Hello,

i'am planning to use HDFS as a DFS in a web application evenvironment.
There are two requirements: fault tolerence, which is ensured by the
replicas and load balancing. Is load balancing part of HDFS and how is
it configurable?

Thanks a lot



Re: Real use-case

2008-08-27 Thread Victor Samoylov
Jeff,

Thanks for help, I want to clarify several details:

1. I know this way to import files to HDFS, but this is connected with
direct accessing HDFS nodes by user.
Does exist another way export all data files from data server side to remote
HDFS nodes without tar invocation?

2. I've setup replication factor as 2. How to setup 50 GB size of FS on one
data node?

Thanks,
Victor Samoylov

On Wed, Aug 27, 2008 at 3:10 AM, Jeff Payne <[EMAIL PROTECTED]> wrote:

> Victor:
>
> I think in your use case the best way to move the data into hadoop would
> either be to tar it up and move it to the same network the HDFS machines
> are
> on, untar it and then run...
>
>  hadoop dfs -put /contents-path /dfs-path
>
> If you only want a replication factor of 2 (the default is 3), open up the
> hadoop.site.xml file and add this snippet...
>
> 
>  dfs.replication
>  2
> 
>
> --
> Jeffrey Payne
> Lead Software Engineer
> Eyealike, Inc.
> [EMAIL PROTECTED]
> www.eyealike.com
> (206) 257-8708
>
>
> "Anything worth doing is worth overdoing."
> -H. Lifter
>
> On Tue, Aug 26, 2008 at 2:54 PM, Victor Samoylov <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hi,
> >
> > I want to use HDFS as DFS to store files.  I have one data server with
> 50Gb
> > data and I plan to use 3 new machines with installed HDFS to duplicate
> this
> > data.
> > These 3 machines are: 1 name node, 2 data nodes. The duplication factor
> for
> > all files is 2.
> >
> > My questions are:
> > 1. How could I create 50 GB data node on one server? Actually I'm very
> > insteresting with setting 50 GB size for data node.
> > 2. What is the best way to export all data files from external server
> (ssh
> > access) to new ones with HDFS?
> >
> > Thanks,
> > Victor Samoylov
> >
>


Load balancing in HDFS

2008-08-27 Thread Mork0075
Hello,

i'am planning to use HDFS as a DFS in a web application evenvironment.
There are two requirements: fault tolerence, which is ensured by the
replicas and load balancing. Is load balancing part of HDFS and how is
it configurable?

Thanks a lot