RE: streaming + binary input/output data?

2008-04-15 Thread John Menzer
a...i understand...that might be a problem! well in that case i would need to parse each base64 encoded line for the '\n' sequence before making any use of it and before adding my own '\n'. i am quite sure that this could become quite performance consuming which in turn would reduce the

Re: [HADOOP-users] HowTo filter files for a Map/Reduce task over the same input folder

2008-04-15 Thread Alfonso Olias Sanz
It's addIputPath, then adds a Path object to the list of inputs. So doing the filtering first then adding the paths (loop). But I need an InputFormat anyway because I have my own RecordReader. At the end I have to put the same logic in a different place. From my point of view it is better for me

NameNode failed to start:port out of range:-1

2008-04-15 Thread 徐强
2008-04-15 18:38:41,756 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2008-04-15 18:38:41,756 INFO org.mortbay.util.Container: Started HttpContext[/logs,/logs] 2008-04-15 18:38:41,756 INFO org.mortbay.util.Container: Started HttpContext[/static,/static] 2008-04-15

Fwd: Getting a DataNode files list

2008-04-15 Thread Shimi K
Is there a way to get a list of files from a specific DataNode in a programmatic way?

_temporary doesn't exist

2008-04-15 Thread Grant Ingersoll
Hi, I am seeing 08/04/15 08:21:13 INFO mapred.JobClient: Task Id : task_200804150637_0003_m_00_0, Status : FAILED java.io.IOException: The directory hdfs://localhost:9000/user/gsi/ 20newsOutput/_temporary doesnt exist at org.apache.hadoop.mapred.TaskTracker

RE: Reduce Output

2008-04-15 Thread Natarajan, Senthil
Thanks Ted that worked. I have one more question. Now I have the Reduce output is something like this. K1 v1 v1 v1 K2 v2 v3 v3 v2 v2 I would like to have it in this way K1 v1(3) K2 v2(3) v3(2) Example: 8.14.0.2_12904 371 371 371 1.7.0.1_50098468 468 468 468 371

RE: _temporary doesn't exist

2008-04-15 Thread Devaraj Das
Hi Grant, could you please copy-paste the exact command you used to run the program. Also the associated config files, etc. will help -Original Message- From: Grant Ingersoll [mailto:[EMAIL PROTECTED] Sent: Tuesday, April 15, 2008 6:03 PM To: core-user@hadoop.apache.org Subject:

jobtracker can be started but NameNode failed to startup.java.lang.IllegalArgumentException: port out of range:-1

2008-04-15 Thread Skater
Hello guys: I followed the tutorial, but I get the following error finally: Could you help me? Is this http://hadoop.apache.org/core/docs/current/quickstart.html#SingleNodeSetup Out of date? I tried a lot of times, but still get the following problem.

Query

2008-04-15 Thread Prerna Manaktala
I tried to set up hadoop with cygwin according to the paper:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 But I had problems working with dyndns.I created a new host there:prerna.dyndns.org and gave the ip address of it in hadoop-ec2-env.sh as a value of MASTER_HOST. But

Archive

2008-04-15 Thread Chaman Singh Verma
Hello How can I see browse through the Archive of Hadoop users ? Everytime I try I get the following message: Not Found The requested URL /mail/core-user/ was not found on this server. - Apache/2.2.8 (Unix) Server at hadoop.apache.org Port 80 between

Re: Archive

2008-04-15 Thread Adrian Woodhead
Yes, it's been like this for around a week or so, nobody has given an ETA on when it will be fixed I'm afraid. I was told to use the nabble archive instead: http://www.nabble.com/Hadoop-core-user-f30590.html Chaman Singh Verma wrote: Hello How can I see browse through the Archive of Hadoop

Large Weblink Graph

2008-04-15 Thread Chaman Singh Verma
Hello, Does anyone have large Weblink graph ? I want to experiment and benchmark MapReduce with some real dataset. Thanks, With regards, Chaman Singh Verma, Poona, India between -00-00 and -99-99

Re: Large Weblink Graph

2008-04-15 Thread Ted Dunning
Please include the Mahout sub-project when you report what you find. This kind of dataset would be very helpful for that project as well. And you might find something helpful there as well. The goal is to support machine learning on hadoop. On 4/15/08 8:29 AM, Chaman Singh Verma [EMAIL

Re: Reduce Output

2008-04-15 Thread Ted Dunning
Just count the items in your reducer. On 4/15/08 6:18 AM, Natarajan, Senthil [EMAIL PROTECTED] wrote: Thanks Ted that worked. I have one more question. Now I have the Reduce output is something like this. K1 v1 v1 v1 K2 v2 v3 v3 v2 v2 I would like to have it in this way

Re: Large Weblink Graph

2008-04-15 Thread Paco NATHAN
Another site which has data sets available for study is UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/ On Tue, Apr 15, 2008 at 8:29 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: Does anyone have large Weblink graph ? I want to experiment and benchmark MapReduce with

Re: Large Weblink Graph

2008-04-15 Thread Chaman Singh Verma
Thanks a lot Andrzej. csv Andrzej Bialecki [EMAIL PROTECTED] wrote: Ted Dunning wrote: Please include the Mahout sub-project when you report what you find. This kind of dataset would be very helpful for that project as well. And you might find something helpful there as well. The goal is

Page Ranking, Hadoop And MPI.

2008-04-15 Thread Chaman Singh Verma
Hello, After googling for many days, I couldn't get one answer from many of the published reports on Ranking algorithm done by Google. Since Google uses GFS for fault tolerance purposes, what communication libraries they might be using to solve such a large matrix ? I presume that standard

How can I use counters in Hadoop

2008-04-15 Thread CloudyEye
Hi, I am new newbie to Hadoop. I would be thankful if you help me. I've read that I can use the Reporter class to increase counters. this way: reporter.incrCounter(Enum args, long arg1); How can I get the values of those counters ? My aim is to count the total inputs to the mappers , then i

Question about reporting progress in mapper tasks. 0.15.3

2008-04-15 Thread Jason Venner
I have a mapper that for each task does extensive computation. In the computation, I increment a counter once per major operation (about once every 5 seconds). I can see this happening by the log messages, that happen around the reporter.incrCounter call. Still my mapper is getting killed

Re: How can I use counters in Hadoop

2008-04-15 Thread stack
https://issues.apache.org/jira/browse/HBASE-559 has an example. Ignore the HBase stuff. Whats important is the ENUM at head of the MR job class, the calls to Reporter inside in tasks, and the properties file -- both how its named and that it ends up in the generated job jar. St.Ack

MapReduce: Two Reduce Tasks

2008-04-15 Thread Chaman
Hello, I am developing some applications in which I can use the output of Map to 3-4 different Reduce tasks ? What is the best way to accomplish such task ? Thanks. With regards, csv -- View this message in context:

Re: MapReduce: Two Reduce Tasks

2008-04-15 Thread Theodore Van Rooy
I think you just want to set your reduce tasks paramaters in hadoop streaming to 3 or 4, and make sure that all the other settings wont push it over 3 or 4.. Why do you want just 3 or 4... have you determined that to be the optimal number of reduces? On Tue, Apr 15, 2008 at 11:49 AM, Chaman

Urgent

2008-04-15 Thread Prerna Manaktala
I tried to set up hadoop with cygwin according to the paper:http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873 But I had problems working with dyndns.I created a new host there:prerna.dyndns.org and gave the ip address of it in hadoop-ec2-env.sh as a value of MASTER_HOST.

EigenValue Calculations, Hadoop and MPI.

2008-04-15 Thread Chaman Singh Verma
Hello, After googling for many days, I couldn't get one answer from many of the published reports on Ranking algorithm done by Google. Since Google uses GFS for fault tolerance purposes, what communication libraries they might be using to solve such a large matrix ? I presume that standard

Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Ted Dunning
Power law algorithms are ideal for this kind of parallelized problem. The basic idea is that hub and authority style algorithms are intimately related to eigenvector or singular value decompositions (depending on whether the links are symmetrical). This also means that there is a close

Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Chaman Singh Verma
Hello, It was wonderful explanation about the role and beauty of eigenvalues in ranking. But I am still far from the real answer/hint. How Google handle such a large matrix and solve it ? Do they use MapReduce framework for these process or adopt standard and reliable Message Passing

Re: Page Ranking, Hadoop And MPI.

2008-04-15 Thread Ted Dunning
On 4/15/08 11:59 AM, Chaman Singh Verma [EMAIL PROTECTED] wrote: How Google handle such a large matrix and solve it ? Do they use MapReduce framework for these process or adopt standard and reliable Message Passing Interface/RPC etc for this task ? They use map-reduce. What about the

Re: multiple datanodes in the same machine

2008-04-15 Thread Ted Dunning
Why do you want to do this perverse thing? How does it help to have more than one datanode per machine? And what in the world is better when you have 10? On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL PROTECTED] wrote: I have a follow-up question, Is there a way to programatically configure

Re: multiple datanodes in the same machine

2008-04-15 Thread cagdas . gerede
Testing when I do not have 10 machines. On 4/15/08, Ted Dunning [EMAIL PROTECTED] wrote: Why do you want to do this perverse thing? How does it help to have more than one datanode per machine? And what in the world is better when you have 10? On 4/15/08 12:53 PM, Cagdas Gerede [EMAIL

Re: Urgent

2008-04-15 Thread Norbert Burger
You need ssh working properly to continue. It sounds like the ssh server isn't listening on port 22. Have you configured it using ssh-host-config? (this is Cygwin-specific) See the 'Windows Users' section on http://wiki.apache.org/hadoop/QuickStart. On Tue, Apr 15, 2008 at 3:28 PM, Prerna

Re: multiple datanodes in the same machine

2008-04-15 Thread Cagdas Gerede
I am working on Distributed File System part. I do not use MR part, and I need to run multiple processes to test some scenarios on the file system. On Tue, Apr 15, 2008 at 1:37 PM, Ted Dunning [EMAIL PROTECTED] wrote: I have had no issues in scaling the number of datanodes. The location of

Re: multiple datanodes in the same machine

2008-04-15 Thread Theodore Van Rooy
Why do you want to do this perverse thing? -agreed. It sounds like even in your testing that you'll not really get the full effect of what you're wanting to test. When you have two installations on the same machine it's likely that the network latency and other issues that occur when

Re: multiple datanodes in the same machine

2008-04-15 Thread Ted Dunning
And the two instances will affect each other significantly so that they will tend to serialize. On 4/15/08 3:24 PM, Theodore Van Rooy [EMAIL PROTECTED] wrote: Why do you want to do this perverse thing? -agreed. It sounds like even in your testing that you'll not really get the full

Re: Urgent

2008-04-15 Thread Prerna Manaktala
hey I am working with the EC2 environment. I registered and am being billed for EC2 and S3. Right now I have two cygwin windows open. 1 is as an administrator-server(on which sshd running) in which I have a separate folder for hadoop files and am able to do bin/hadoop 1 as a normal user-client.

Re: EigenValue Calculations, Hadoop and MPI.

2008-04-15 Thread Edward J. Yoon
Have you seen the book Google's PageRank and Beyond? :) they might be using MapReduce ... I don't think Map/Reduce is a advanced parallel computing model, but i'm agree with you. Have you seen the Hama proposal? (http://wiki.apache.org/incubator/HamaProposal) I'll presentation ideas about Hama

Re: Question about reporting progress in mapper tasks. 0.15.3 - solved

2008-04-15 Thread Jason Venner
Well, on deeper reading of the code and the documentation, reporter.progress(), is the required call. Jason Venner wrote: I have a mapper that for each task does extensive computation. In the computation, I increment a counter once per major operation (about once every 5 seconds). I can see

adding nodes to an EC2 cluster

2008-04-15 Thread Stephen J. Barr
Hello, Does anyone have any experience adding nodes to a cluster running on EC2? If so, is there some documentation on how to do this? Thanks, -stephen

Re: adding nodes to an EC2 cluster

2008-04-15 Thread Chris K Wensel
Stephen Check out the patch in Hadoop-2410 to the contrib/ec2 scripts https://issues.apache.org/jira/browse/HADOOP-2410 (just grab the ec2.tgz attachment) these scripts allow you do dynamically grow your cluster plus some extra goodies. you will need to use them to build your own ami, they

Re: adding nodes to an EC2 cluster

2008-04-15 Thread Stephen J. Barr
Thank you. I will check that out. I haven't built an AMI before. Hopefully it isn't too complicated, as it is easy to use the pre-built AMI's. -stephen Chris K Wensel wrote: Stephen Check out the patch in Hadoop-2410 to the contrib/ec2 scripts

Re: Reading Configuration File

2008-04-15 Thread Shimi K
Just put it in the classpath On Tue, Apr 15, 2008 at 11:50 PM, Natarajan, Senthil [EMAIL PROTECTED] wrote: Hi, How to read configuration file in Hadoop. I tried by copying the file in HDFS and also placing within the jar file. I tried like this in Map constructor Configuration conf = new

Re: Reading Configuration File

2008-04-15 Thread Amar Kamat
Natarajan, Senthil wrote: Hi, How to read configuration file in Hadoop. I tried by copying the file in HDFS and also placing within the jar file. Do you intend to read the job's config file or a separate file? In case of accessing the job specific config, overload the configure(JobConf)

Re: adding nodes to an EC2 cluster

2008-04-15 Thread Chris K Wensel
I'm unsure of your particular problem. but the scripts/patch I referenced previously remove any dependency on DynDNS. the recipe would be something like... make a s3 bucket and update hadoop-ec2-env.sh make an image: hadoop-ec2 create-image make a 2 node (3 machine) cluster: hadoop-ec2