Re: Seattle / PNW Hadoop + Lucene User Group?

2009-04-21 Thread Tushar Jain
I'm interested in joining too. Let me know when and where. We can always head down to a bar or coffee shop in seattle too. Tushar On Mon, Apr 20, 2009 at 6:31 PM, Lauren Cooney laurencoo...@gmail.com wrote: If you guys are interested in space over in Redmond, I can see if MSFT can host. Let

How to run many jobs at the same time?

2009-04-21 Thread nguyenhuynh.mr
Hi all! I have some jobs: job1, job2, job3,... . Each job working with the group. To control jobs, I have JobControllers, each JobController control jobs follow the specified group. Example: - Have 2 Group: g1 and g2 - 2 JobController: jController1, jcontroller2 + jController1

Re: How to run many jobs at the same time?

2009-04-21 Thread Tom White
You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new Thread(jobControl); t.start(); Then poll the jobControl.allFinished() method. Tom On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr nguyenhuynh...@gmail.com wrote: Hi all!

Re: RPM spec file for 0.19.1

2009-04-21 Thread Steve Loughran
Ian Soboroff wrote: Steve Loughran ste...@apache.org writes: I think from your perpective it makes sense as it stops anyone getting itchy fingers and doing their own RPMs. Um, what's wrong with that? It's reallly hard to do good RPM spec files. If cloudera are willing to pay Matt to do

Re: How many people is using Hadoop Streaming ?

2009-04-21 Thread Steve Loughran
Tim Wintle wrote: On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote: 1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better). In fact, I can even chosen Erlang at the map() and

Re: Ec2 instability

2009-04-21 Thread Tim Hawkins
I would be interested in understanding what problems you are having, we are using 19.0 in production on EC2, running nutch and a set of custom apps in a mixed workload on a farm of 5 instances. On 17 Apr 2009, at 18:05, Ted Coyle wrote: Rakhi, I'd suggest going to 0.19.1. hbase and

Re: fyi: A Comparison of Approaches to Large-Scale Data Analysis: MapReduce vs. DBMS Benchmarks

2009-04-21 Thread Steve Loughran
Andrew Newman wrote: They are comparing an indexed system with one that isn't. Why is Hadoop faster at loading than the others? Surely no one would be surprised that it would be slower - I'm surprised at how well Hadoop does. Who want to write a paper for next year, grep vs reverse index?

Re: max value for a dataset

2009-04-21 Thread Edward Capriolo
On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Jason, Wouldn't this be avoided if you used a combiner to also perform the max() operation?  A minimal amount of data would be written over the network. I can't remember if the map output gets written to disk

Re: Error reading task output

2009-04-21 Thread Steve Loughran
Cam Macdonell wrote: Well, for future googlers, I'll answer my own post. Watch our for the hostname at the end of localhost lines on slaves. One of my slaves was registering itself as localhost.localdomain with the jobtracker. Is there a way that Hadoop could be made to not be so

Re: getting DiskErrorException during map

2009-04-21 Thread Steve Loughran
Jim Twensky wrote: Yes, here is how it looks: property namehadoop.tmp.dir/name value/scratch/local/jim/hadoop-${user.name}/value /property so I don't know why it still writes to /tmp. As a temporary workaround, I created a symbolic link from /tmp/hadoop-jim to

Re: Error reading task output

2009-04-21 Thread Steve Loughran
Aaron Kimball wrote: Cam, This isn't Hadoop-specific, it's how Linux treats its network configuration. If you look at /etc/host.conf, you'll probably see a line that says order hosts, bind -- this is telling Linux's DNS resolution library to first read your /etc/hosts file, then check an

Re: Multiple outputs and getmerge?

2009-04-21 Thread Todd Lipcon
On Mon, Apr 20, 2009 at 1:14 PM, Stuart White stuart.whi...@gmail.comwrote: Is this the best/only way to deal with this? It would be better if hadoop offered the option of writing different outputs to different output directories, or if getmerge offered the ability to specify a file prefix

Hadoop and Matlab

2009-04-21 Thread Sameer Tilak
Hi there, We're working on an image analysis project. The image processing code is written in Matlab. If I invoke that code from a shell script and then use that shell script within Hadoop streaming, will that work? Has anyone done something along these lines? Many thaks, --ST.

RE: Hadoop and Matlab

2009-04-21 Thread Patterson, Josh
Sameer, I'd also be interested in that as well; We are constructing a hadoop cluster for energy data (PMU) for the NERC and we will be potentially running jobs for a number of groups and researchers. I know some researchers will know nothing of map reduce, yet are very keen on MatLab, so we're

RE: Multiple outputs and getmerge?

2009-04-21 Thread Koji Noguchi
Stuart, I once used MultipleOutputFormat and created (mapred.work.output.dir)/type1/part-_ (mapred.work.output.dir)/type2/part-_ ... And JobTracker took care of the renaming to (mapred.output.dir)/type{1,2}/part-__ Would that work for you? Koji -Original

Re: Hadoop and Matlab

2009-04-21 Thread Peter Skomoroch
If you can compile the matlab code to an executable with the matlab compiler and send it to the nodes with the distributed cache that should work... You probably want to avoid licensing fees for running copies of matlab itself on the cluster. Sent from my iPhone On Apr 21, 2009, at 1:55

Re: Multiple outputs and getmerge?

2009-04-21 Thread Stuart White
On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon t...@cloudera.com wrote: Would dfs -cat do what you need? e.g: ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* /tmp/exceptions-merged Yes, that would work. Thanks for the suggestion.

Re: Multiple outputs and getmerge?

2009-04-21 Thread Stuart White
On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi knogu...@yahoo-inc.com wrote: I once used MultipleOutputFormat and created   (mapred.work.output.dir)/type1/part-_   (mapred.work.output.dir)/type2/part-_    ... And JobTracker took care of the renaming to  

Issues with retrieving HDFS directory contents in Java

2009-04-21 Thread Praveen Patnala
Resending the query with a different subject - Was : FileSystem.listStatus() doesn't return list of files in hdfs directory‏ I have a single-node hadoop cluster. The hadoop version - [patn...@ac4-dev-ims-211]~/dev/hadoop/hadoop-0.19.1% hadoop version Hadoop 0.19.1 Subversion

No route to host prevents from storing files to HDFS

2009-04-21 Thread Stas Oskin
Hi. I have quite a strange issue, where one of the datanodes that I have, rejects any blocks with error messages. I looked in the datanode logs, and found the following error: 2009-04-21 16:59:19,092 ERROR org.apache.hadoop.dfs.DataNode: DatanodeRegistration(192.168.253.20:50010,

Re: Copying files from HDFS to remote database

2009-04-21 Thread Dhruba Borthakur
You can use any of these: 1. bin/hadoop dfs -get hdfsfile remote filename 2. Thrift API : http://wiki.apache.org/hadoop/HDFS-APIs 3. use fuse-mount ot mount hdfs as a regular file system on remote machine: http://wiki.apache.org/hadoop/MountableHDFS thanks, dhruba On Mon, Apr 20, 2009 at

mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread javateck javateck
I set my mapred.tasktracker.map.tasks.maximum to 10, but when I run a task, it's only using 2 out of 10, any way to know why it's only using 2? thanks

Re: No route to host prevents from storing files to HDFS

2009-04-21 Thread Stas Oskin
Hi again. Other tools, like balancer, or the web browsing from namenode, don't work as well. This because other nodes complain about not reaching the offending node as well. I even tried netcat'ing the IP/port from another node - and it successfully connected. Any advice on this No route to

RE: Multiple outputs and getmerge?

2009-04-21 Thread Koji Noguchi
Something in the lines of ... class MyOutputFormat extends MultipleTextOutputFormatText, Text { protected String generateFileNameForKeyValue(Text key, Text v, String name) { Path outpath = new Path(key.toString(), name); return

RE: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Koji Noguchi
It's probably a silly question, but you do have more than 2 mappers on your second job? If yes, I have no idea what's happening. Koji -Original Message- From: javateck javateck [mailto:javat...@gmail.com] Sent: Tuesday, April 21, 2009 1:38 PM To: core-user@hadoop.apache.org Subject:

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread javateck javateck
Hi Koji, Thanks for helping. I don't know why hadoop is just using 2 out of 10 map tasks slots. Sure, I just cut and paste the job tracker web UI, clearly I set the max tasks to 10(which I can verify from hadoop-site.xml and from the individual job configuration also), and I did have the first

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Miles Osborne
is your input data compressed? if so then you will get one mapper per file Miles 2009/4/21 javateck javateck javat...@gmail.com: Hi Koji, Thanks for helping. I don't know why hadoop is just using 2 out of 10 map tasks slots. Sure, I just cut and paste the job tracker web UI, clearly I

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread javateck javateck
no, it's plain text file with \t delimited. And I'm expecting one mapper per file, because I have 175 files, and I got 189 map tasks from what I can see from the web UI. My issue is that since I have 189 map tasks waiting, why hadoop is just using 2 of my 10 map slots, and I assume that all map

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread javateck javateck
I want to have something to clarify, for the max task slots, are these places to check: 1. hadoop-site.xml 2. the specific job's job.conf which can be retrieved though the job, for example, logs/job_200904212336_0002_conf.xml Any other place to limit the task map counts? In my case, it's

Re: No route to host prevents from storing files to HDFS

2009-04-21 Thread Philip Zeyliger
Very naively looking at the code, the exception you see is happening in the write path, on the way to sending a copy of your data to a second data node. One data node is pipelining the data to another, and that connection is failing. The fact that DatanodeRegistration is mentioned in the

Re: Hadoop and Matlab

2009-04-21 Thread Edward J. Yoon
Hi, What is the input data? According to my understanding, you have a lot of images and want to process all images using your matlab script. Then, You should write some code yourself. I did similar thing for plotting graph with gnuplot. However, If you want to do large-scale linear algebra

anyone knows why setting mapred.tasktracker.map.tasks.maximum not working?

2009-04-21 Thread javateck javateck
anyone knows why setting *mapred.tasktracker.map.tasks.maximum* not working? I set it to 10, but still see only 2 map tasks running when running one job

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Miles Osborne
they are the places to check. a job can itself over-ride the number of mappers and reducers. for example, using streaming, i often state the number of mappers and reducers i want to use: -jobconf mapred.reduce.tasks=30 this would tell hadoop to use 30 reducers, for example. if you don't have

Re: Hadoop and Matlab

2009-04-21 Thread Sameer Tilak
Hi Edward, Yes, we're building this for handling hundreds of thousands images (at least). We're thinking processing of individual images (or a set of images together) will be done in Matlab itself. However, we can use Hadoop framework to process the data in parallel fashion. One Matlab instance

Re: How to run many jobs at the same time?

2009-04-21 Thread nguyenhuynh.mr
Tom White wrote: You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new Thread(jobControl); t.start(); Then poll the jobControl.allFinished() method. Tom On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr

Re: Hadoop and Matlab

2009-04-21 Thread Edward J. Yoon
Hi, Where to store the images? How to retrieval the images? If you have a metadata for the images, the map task can receives a 'filename' of image as a key, and file properies (host, file path, ..,etc) as its value. Then, I guess you can handle the matlab process using runtime object on hadoop

Re: max value for a dataset

2009-04-21 Thread jason hadoop
There will be a short summary of the hadoop aggregation tools in ch08, it got missed in the first pass through, and is being added back in this week. There are a number of howto's in the book particularly in ch08 and ch09. I hope you enjoy them. On Tue, Apr 21, 2009 at 8:24 AM, Edward Capriolo

Re: max value for a dataset

2009-04-21 Thread jason hadoop
There is no reason to use a combiner in this case, as there is only a single output record from the map. Combiners buy you data reduction when you have output values in your map that share keys, and your application allows you to do something with the values that results in smaller/fewer records

Re: anyone knows why setting mapred.tasktracker.map.tasks.maximum not working?

2009-04-21 Thread jason hadoop
There must be only 2 input splits being produced for your job. Either you have 2 unsplitable files, or the input file(s) you have are not large enough compared to the block size to be split. Table 6-1 in chapter 06 gives a breakdown of all of the configuration parameters that affect split size in

Re: getting DiskErrorException during map

2009-04-21 Thread jason hadoop
For reasons that I have never bothered to investigate I have never had a cluster work when the hadoop.tmp.dir was not identical on all of the nodes. My solution has always been to just make a symbolic link so that hadoop.tmp.dir was identical and on the machine in question really ended up in the

Re: getting DiskErrorException during map

2009-04-21 Thread Brian Bockelman
Hey Jason, We've never had the hadoop.tmp.dir identical on all our nodes. Brian On Apr 22, 2009, at 10:54 AM, jason hadoop wrote: For reasons that I have never bothered to investigate I have never had a cluster work when the hadoop.tmp.dir was not identical on all of the nodes. My

How to access data node without a passphrase?

2009-04-21 Thread Yabo-Arber Xu
Hi there, I setup a small cluster for testing. When I start my cluster on my master node, I have to type the password for starting each datanode and tasktracker. That's pretty annoying and may be hard to handle when the cluster grows. Any graceful way to handle this? Best, Arber

Re: How to access data node without a passphrase?

2009-04-21 Thread Amit Saha
On Wed, Apr 22, 2009 at 9:26 AM, Yabo-Arber Xu arber.resea...@gmail.com wrote: Hi there, I setup a small cluster for testing. When I start my cluster on my master node, I have to type the password for starting each datanode and tasktracker. That's pretty annoying and may be hard to handle

Re: How to access data node without a passphrase?

2009-04-21 Thread Alex Loddengaard
I would recommend installing the Hadoop RPMs and avoid the start-all scripts all together. The RPMs ship with init scripts, allowing you to start and stop daemons with /sbin/service (or with a configuration management tool, which I assume you'll be using as your cluster grows). Here's more info

RE: How to access data node without a passphrase?

2009-04-21 Thread Puri, Aseem
Arber, A. You have to first setup authorization keys 1. Execute the following command to generate keys: ssh-keygen 2. When prompted for filenames and pass phrases press ENTER to accept default values. 3. After the command has finished generating keys, enter the following command to change into

Re: How to access data node without a passphrase?

2009-04-21 Thread Yabo-Arber Xu
Thanks for all your help, especially Asteem's detailed instruction. It works now! Alex: I did not use RPMs, but several of my existing nodes are installed with Ubuntu. Is there any diff on running Hadoop on Ubuntu? I am thinking of choosing one before I started scaling up the cluster, but not