RE: Questions regarding configuration parameters...

2008-02-22 Thread Tim Wintle
I have had exactly the same problem with using the command line to cat files - they can take for ages, although I don't know why. Network utilisation does not seem to be the bottleneck, though. (Running 0.15.3) Is the slow part of the reduce while you are waiting for the map data to copy over to

Re: Sorting output data on value

2008-02-22 Thread Owen O'Malley
On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote: But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers output value. I thought the question was how to get the values sorted within a call

Re: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread André Martin
Hi Raghu, done: https://issues.apache.org/jira/browse/HADOOP-2873 Subsequent tries did not succeed - so it looks like I need to re-format the cluster :-( Cu on the 'net, Bye - bye, André èrbnA Raghu Angadi wrote: Please file a jira

Re: Sorting output data on value

2008-02-22 Thread Tarandeep Singh
On Fri, Feb 22, 2008 at 5:46 AM, Owen O'Malley [EMAIL PROTECTED] wrote: On Feb 21, 2008, at 11:01 PM, Ted Dunning wrote: But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers

What to use instead of globPaths (deprecated)?

2008-02-22 Thread Josh Snyder
Hi, In the current API documentation, FileSystem.globPaths is marked as deprecated. However, I couldn't figure out what I could use in its place. What is the preferred alternative to globPaths? I'm new to this list and to Hadoop, so I apologize if this is obvious -- but grepping + skimming

Re: What to use instead of globPaths (deprecated)?

2008-02-22 Thread Josh Snyder
Sorry, I'm an idiot. Following the law that says one figures it out immediately on pestering others -- globStatus will do it. Thanks, Josh On 2/22/08, Josh Snyder [EMAIL PROTECTED] wrote: Hi, In the current API documentation, FileSystem.globPaths is marked as deprecated. However, I

Calculations involve large datasets

2008-02-22 Thread Chuck Lan
Hi, I'm currently looking into how to better scale the performance of our calculations involving large sets of financial data. It is currently using a series of Oracle SQL statements to perform the calculations. It seems to me that the MapReduce algorithm may work in this scenario. However, I

RE: Questions regarding configuration parameters...

2008-02-22 Thread C G
Guys: Thanks for the information...I've gotten some pretty good results twiddling some parameters. I've also reminded myself about the pitfalls of oversubscribing resources (like number of reducers). Here's what I learned, written up here to hopefully help somebody later... I set

Re: Problem with LibHDFS

2008-02-22 Thread Arun C Murthy
On Feb 21, 2008, at 3:29 AM, Raghavendra K wrote: Hi, I am able to get Hadoop running and also able to compile the libhdfs. But when I run the hdfs_test program it is giving Segmentation Fault. Unfortunately the documentation for using libhdfs is sparse, our apologies. You'll need

Re: Calculations involve large datasets

2008-02-22 Thread Amar Kamat
See http://incubator.apache.org/pig/. Hope that helps. Not sure how joins could be done in Hadoop. Amar On Fri, 22 Feb 2008, Chuck Lan wrote: Hi, I'm currently looking into how to better scale the performance of our calculations involving large sets of financial data. It is currently using a

Re: Calculations involve large datasets

2008-02-22 Thread Tim Wintle
Have you seen PIG: http://incubator.apache.org/pig/ It generates hadoop code and is more query like, and (as far as I remember) includes union, join, etc. Tim On Fri, 2008-02-22 at 09:13 -0800, Chuck Lan wrote: Hi, I'm currently looking into how to better scale the performance of our

Re: Sorting output data on value

2008-02-22 Thread Doug Cutting
Tarandeep Singh wrote: but isn't the output of reduce step sorted ? No, the input of reduce is sorted by key. The output of reduce is generally produced as the input arrives, so is generally also sorted by key, but reducers can output whatever they like. Doug

Re: How to split the hdfs in different subgroups

2008-02-22 Thread Raghu Angadi
You could probably treat these two groups as different racks. You can read about rackawareness in http://hadoop.apache.org/core/docs/r0.16.0/hdfs_user_guide.html , and follow the links from there for more information regd how to configure etc. Raghu. [EMAIL PROTECTED] wrote: Hi There, I

Re: Hadoop summit / workshop at Yahoo!

2008-02-22 Thread Stefan Groschupf
Puhh, 2 days and it is full? Does Yahoo have no bigger rooms than just for a 100 people? On Feb 20, 2008, at 12:10 PM, Ajay Anand wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are

Re: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread Konstantin Shvachko
André, You can try to rollback. You did use upgrade when you switched to the new trunk, right? --Konstantin Raghu Angadi wrote: André Martin wrote: Hi Raghu, done: https://issues.apache.org/jira/browse/HADOOP-2873 Subsequent tries did not succeed - so it looks like I need to re-format the

Re: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread Steve Sapovits
Raghu Angadi wrote: Please report such problems if you think it was because of HDFS, as opposed to some hardware or disk failures. Will do. I suspect it's something else. I'm testing on a notebook in pseudo-distributed mode (per the quick start guide). My IP changes when I take that box

RE: Namenode fails to re-start after cluster shutdown

2008-02-22 Thread dhruba Borthakur
If your file system metadata is in /tmp, then you are likely to see these kinds of problems. It would be nice if you can move the location of your metadata files away from /tmp. If you still see the problem, can you pl send us the logs from the log directory? Thanks a bunch, Dhruba

Re: Calculations involve large datasets

2008-02-22 Thread Ted Dunning
Joins are easy. Just reduce on a key composed of the stuff you want to join on. If the data you are joining is disparate, leave some kind of hint about what kind of record you have. The reducer will be iterating through sets of records that have the same key. This is similar to the results

Re: Problems running a HOD test cluster

2008-02-22 Thread Jason Venner
We have been unable to get torque up and running. The magic value in the server_name file seems to elude us. We have tried localhost, 127.0.0.1, machine name, machine ip, fq machine name. Depending on what we use, we either get Unauthorized request or invalid entry qmgr obj= svr=default: Bad

RE: How to split the hdfs in different subgroups

2008-02-22 Thread xavier.quintuna
I read the docs about rack awareness but my issue is how the client can pick some specific datanodes, which are located in some specific rack, to write the block there. The idea is that the client is able to write the block in two separated groups of datanodes in the same hdfs. For instance:

RE: Hadoop summit / workshop at Yahoo!

2008-02-22 Thread xavier.quintuna
I agree, I love to be part of this but the rooms are full. Xavier -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Friday, February 22, 2008 11:04 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Puhh, 2 days and it is full?

RE: Calculations involve large datasets

2008-02-22 Thread Runping Qi
There is a package for joining data from multiple sources: contrib/data-join. It implements the basic joining logic and allows the user to provide application specific logic for filtering/projecting and combining multiple records into one. Runping -Original Message- From: Ted

Problems with NFS share in dfs.name.dir

2008-02-22 Thread Nathan Wang
Hi, We're having problems when trying to deal with the namenode failover, by following the wiki http://wiki.apache.org/hadoop/NameNodeFailover If we point dfs.name.dir to 2 local directories, it works fine. But, if one of the directories is NFS mounted, we're having these problems: 1)

Re: How to split the hdfs in different subgroups

2008-02-22 Thread Raghu Angadi
[EMAIL PROTECTED] wrote: I read the docs about rack awareness but my issue is how the client can pick some specific datanodes, which are located in some specific rack, to write the block there. The idea is that the client is able to write the block in two separated groups of datanodes in the

how to use two reduce fucntions?

2008-02-22 Thread ma qiang
Hi all, I have a program need to use two reduce fucntions, who can tell me why? Thank you! Qiang