possibility to start reducer only after mapper completed certain percentage

2008-08-27 Thread Pallavi Palleti
Hi, I have a dependency over mapper jobs completion time in my reducer configure method. Where, I am building a dictionary by collecting the same in pieces from mapper jobs. I will be using this dictionary in reduce() method. Can some one please help me if I can put a constraint over reducer

Load balancing in HDFS

2008-08-27 Thread Mork0075
Hello, i'am planning to use HDFS as a DFS in a web application evenvironment. There are two requirements: fault tolerence, which is ensured by the replicas and load balancing. Is load balancing part of HDFS and how is it configurable? Thanks a lot

Re: Real use-case

2008-08-27 Thread Victor Samoylov
Jeff, Thanks for help, I want to clarify several details: 1. I know this way to import files to HDFS, but this is connected with direct accessing HDFS nodes by user. Does exist another way export all data files from data server side to remote HDFS nodes without tar invocation? 2. I've setup

Re: Load balancing in HDFS

2008-08-27 Thread lohit
If you have a fixed set of nodes in cluster and load data onto HDFS, it tries to automatically balance the distribution across nodes by selecting random nodes to store replicas. This has to be done with a client which is outside the datanodes for random distribution. If you add new nodes to

How can I debug the hadoop source code within eclipse

2008-08-27 Thread li luo
Hi, All I am deploy hadoop in my Linux computer as a single-node environment.And everything running good.Now I want to follow the running step of hadoop job step by step in Eclipse.please tell me how I can do that? Thanks

Re: too many fetch-failures

2008-08-27 Thread Edward J. Yoon
when i run example wordcount i have problem like this : Is wordcount a hadoop example? or your code? On 8/16/08, tran thien [EMAIL PROTECTED] wrote: hi everyone, i am using hadoop 0.17.1. There are 2 node : one master(also slave) and one slave. when i run example wordcount i have problem

Re: Load balancing in HDFS

2008-08-27 Thread Allen Wittenauer
On 8/27/08 12:54 AM, Mork0075 [EMAIL PROTECTED] wrote: i'am planning to use HDFS as a DFS in a web application evenvironment. There are two requirements: fault tolerence, which is ensured by the replicas and load balancing. There is a SPOF in the form of the name node. So depending

Re: Load balancing in HDFS

2008-08-27 Thread Mork0075
This sound really interesting. And while increasing the replicas for certain files, the available troughput for these files increases too? Allen Wittenauer schrieb: On 8/27/08 12:54 AM, Mork0075 [EMAIL PROTECTED] wrote: i'am planning to use HDFS as a DFS in a web application evenvironment.

how use only a reducer without a mapper

2008-08-27 Thread Leandro Alvim
Hi, I need help if it's possible. My name is Leandro Alvim and i`m a graduated in computer science in Brazil. So, i'm using hadoop in my university project and i used your tutorials to learn how to install and run a simple test with python and hadoop. Writing my application i faced a problem that

Re: how use only a reducer without a mapper

2008-08-27 Thread Miles Osborne
Streaming has the ability to accept as input multiple directories, so that would enable you to merge two directories (--is this an assignment? ...) Miles 2008/8/27 Leandro Alvim [EMAIL PROTECTED] Hi, I need help if it's possible. My name is Leandro Alvim and i`m a graduated in computer

Re: Why is scaling HBase much simpler then scaling a relational db?

2008-08-27 Thread Edward J. Yoon
Hi, Planet-scale data explorations and data mining operations will almost always need to include some sequential scans. Then, How can we speed up sequential scans? BigTable paper shows that. * Column-oriented storage (it reduces I/O) * Data compression * PDP (parallel distributed processing)

IsolationRunner [was Re: extracting input to a task from a (streaming) job?]

2008-08-27 Thread Yuri Pradkin
I posted this a while back and have been wondering whether I missed something and the doc is out of date or this is a bug and I should file a jira. Is there anyone out there who is successfully using IsolationRunner? Please let me know. Thanks, -Yuri On Friday 08 August 2008 10:09:48

RE: Why is scaling HBase much simpler then scaling a relational db?

2008-08-27 Thread Jonathan Gray
Discussion inline. You example with the friends makes perfectly sense. Can you imagine a scenario where storing the data in column oriented instead of row oriented db (so if you will an counterexample) causes such a huge performance mismatch, like the friends one in row/column comparison?

Re: possibility to start reducer only after mapper completed certain percentage

2008-08-27 Thread Owen O'Malley
On Tue, Aug 26, 2008 at 11:54 PM, Pallavi Palleti [EMAIL PROTECTED]wrote: Where, I am building a dictionary by collecting the same in pieces from mapper jobs. I will be using this dictionary in reduce() method. Can some one please help me if I can put a constraint over reducer startup time?

Re: JobTracker Web interface sometimes does not display in IE7

2008-08-27 Thread Owen O'Malley
Also consider making a patch to fix the behavior and submit it.

Re: how use only a reducer without a mapper

2008-08-27 Thread Richard Tomsett
Leandro Alvim wrote: How can i use only a reduce without map? I don't know if there's a way to run just a reduce task without a map stage, but you could do it by having a map stage just using the IdentityMapper class (which passes the data through to the reducers unchanged), so

Re: how use only a reducer without a mapper

2008-08-27 Thread Jason Venner
The down side of this (which appears to be the only way) is that your entire input data set has to pass through the identity mapper and then go through shuffle and sort before it gets to the reducer. If you have a large input data set, this takes real resources - cpu, disk, network and wall

Re: how use only a reducer without a mapper

2008-08-27 Thread Owen O'Malley
On Wed, Aug 27, 2008 at 9:27 AM, Jason Venner [EMAIL PROTECTED] wrote: The down side of this (which appears to be the only way) is that your entire input data set has to pass through the identity mapper and then go through shuffle and sort before it gets to the reducer. If you don't need the

Re: Real use-case

2008-08-27 Thread Jeff Payne
In order to do anything other than a tar transfer (which is a kludge, of course), you'll need to open up the relevant ports between the client and the hadoop cluster. I may miss a few here, but I believe these would include port 50010 for the datanodes and whatever port the namenode is listening

Re: questions on sorting big files and sorting order

2008-08-27 Thread Tarandeep Singh
On Tue, Aug 26, 2008 at 7:50 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Tue, Aug 26, 2008 at 12:39 AM, charles du [EMAIL PROTECTED] wrote: I would like to sort a large number of records in a big file based on a given field (key). The property you are looking for is a total order and

Design of the new job tracker

2008-08-27 Thread Yiping Han
Hi, I want to know where is the detailed description of the next gen. job tracker, which replaces hod? Thanks~ --Yiping Han

Optimizations

2008-08-27 Thread Yih Sun Khoo
Optimizations Right now I have a job whose reducer phase outputs the key-value pairs as records into a database. Is this the best way to be loading the database? What are some alternatives?

MultipleOutputFormat versus MultipleOutputs

2008-08-27 Thread Shirley Cohen
Hi, I would like the reducer to output to different files based upon the value of the key. I understand that both MultipleOutputs and MultipleOutputFormat can do this. Is that correct? However, I don't understand the differences between these two classes. Can someone explain the

Re: Optimizations

2008-08-27 Thread Edward J. Yoon
Recently, most DBMS have a bulk insert mechanism instead of each transaction. Check it. -Edward On Thu, Aug 28, 2008 at 6:27 AM, Yih Sun Khoo [EMAIL PROTECTED] wrote: Optimizations Right now I have a job whose reducer phase outputs the key-value pairs as records into a database. Is this

Re: JobTracker Web interface sometimes does not display in IE7

2008-08-27 Thread Edward J. Yoon
Hi owen, There is no solution on the server. Instead, I think we can add some guide on how to fix this, or to explain this phenomenon. On Thu, Aug 28, 2008 at 1:16 AM, Owen O'Malley [EMAIL PROTECTED] wrote: Also consider making a patch to fix the behavior and submit it. -- Best regards,

RE: Design of the new job tracker

2008-08-27 Thread Vivek Ratan
There are a number of Jiras that modify the scheduling piece of the JobTracker. - 3412 refactors the scheduler code out of the JT to make schedulers more pluggable - 3445 and 3746 are a couple of new schedulers that do more than the default JT scheduler. Both these can be good alternatives to

Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-08-27 Thread wangxu
Hi,all I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar, and running hadoop on one namenode and 4 slaves. attached is my hadoop-site.xml, and I didn't change the file hadoop-default.xml when data in segments are large,this kind of errors occure: java.io.IOException: Could not