Re: file/directory sizes

2008-02-21 Thread Ted Dunning
It is definitely better to combine files into larger ones, if only to make sure that you use sequential reads as much as possible. On 2/21/08 9:48 PM, "Steve Sapovits" <[EMAIL PROTECTED]> wrote: > Amar Kamat wrote: > >> File sizes and number of files (assuming thats what you want to tweak) >>

Re: Sorting output data on value

2008-02-21 Thread Ted Dunning
But this only guarantees that the results will be sorted within each reducers input. Thus, this won't result in getting the results sorted by the reducers output value. On 2/21/08 8:40 PM, "Owen O'Malley" <[EMAIL PROTECTED]> wrote: > > On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote: > >> It

RE: Questions regarding configuration parameters...

2008-02-21 Thread C G
My performance problems fall into 2 categories: 1. Extremely slow reduce phases - our map phases march along at impressive speed, but during reduce phases most nodes go idle...the active machines mostly clunk along at 10-30% CPU. Compare this to the map phase where I get all grid nodes c

Re: file/directory sizes

2008-02-21 Thread Steve Sapovits
Amar Kamat wrote: File sizes and number of files (assuming thats what you want to tweak) is not much of a concern for map-reduce. What ultimately matters is the dfs-block-size and split-size. The basic unit of replication in DFS is the block while the basic processing unit for map-reduce is th

Re: file/directory sizes

2008-02-21 Thread Amar Kamat
File sizes and number of files (assuming thats what you want to tweak) is not much of a concern for map-reduce. What ultimately matters is the dfs-block-size and split-size. The basic unit of replication in DFS is the block while the basic processing unit for map-reduce is the split. Other para

Problems running a HOD test cluster

2008-02-21 Thread Luca
Hello everyone, I've been trying to run HOD on a sample cluster with three nodes that already have Torque installed and (hopefully?) properly working. I also prepared a configuration file for hod, that I'm gonna paste at the end of this email. A few questions: - is Java6 ok for HOD? - I have

Re: Sorting output data on value

2008-02-21 Thread Owen O'Malley
On Feb 21, 2008, at 5:47 PM, Ted Dunning wrote: It may be sorted within the output for a single reducer and, indeed, you can even guarantee that it is sorted but *only* by the reduce key. The order that values appear will not be deterministic. Actually, there is a better answer for this.

file/directory sizes

2008-02-21 Thread Steve Sapovits
I'm looking for any information on "best" type Hadoop configurations, in terms of numbers of files, numbers of files per directory, and file sizes (e.g., are lots of small files more of a problem than fewer larger ones, etc.). Any pointers to documentation or experience feedback appreciated.

Re: Problem with LibHDFS

2008-02-21 Thread Raghavendra K
i tried even that and the output is Program received signal SIGSEGV, Segmentation fault. 0x0001 in ?? () (gdb) bt #0 0x0001 in ?? () (gdb) its the same thing...dont know what to do.. On Thu, Feb 21, 2008 at 10:11 PM, Jaideep Dhok <[EMAIL PROTECTED]> wrote: > Type 'bt' on the gdb p

Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Ian Holsman
Derek Anderson wrote: seconded. (or thirded? :) the commute from dallas is rough. Dallas is around the corner. I've got people in Bangalore, Ireland and Australia who this would be useful for ;-) Tim Wintle wrote: I would certainly appreciate being able to watch them online too, and th

Re: Sorting output data on value

2008-02-21 Thread Ted Dunning
It may be sorted within the output for a single reducer and, indeed, you can even guarantee that it is sorted but *only* by the reduce key. The order that values appear will not be deterministic. To sort by value, you need to run another MR job with the count from the first step as the key and

Re: Sorting output data on value

2008-02-21 Thread Tarandeep Singh
On Thu, Feb 21, 2008 at 5:34 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Use another job step to get the sort done. > but isn't the output of reduce step sorted ? Also can I specify that sort be done in reverse order ? > > > On 2/21/08 5:11 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: >

Re: Sorting output data on value

2008-02-21 Thread Ted Dunning
Use another job step to get the sort done. On 2/21/08 5:11 PM, "Tarandeep Singh" <[EMAIL PROTECTED]> wrote: > On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <[EMAIL PROTECTED]> wrote: >> hi, >> >> Can I sort the output of reducer based on the value instead of key. >> Also can I specify that

Re: Sorting output data on value

2008-02-21 Thread Tarandeep Singh
On Thu, Feb 21, 2008 at 3:46 PM, Tarandeep Singh <[EMAIL PROTECTED]> wrote: > hi, > > Can I sort the output of reducer based on the value instead of key. > Also can I specify that the output should be sorted in decreasing order ? > > Mapper output - > > > Reducer gets- > > > and outputs

Re: Python access to HDFS

2008-02-21 Thread Steve Sapovits
Roddy Lindsay wrote: I do it the old fashioned way: (w, r) = os.popen2("%s/bin/hadoop dfs -cat %s" % (hadoop_home.rstrip('/'), filename)) I considered this but ultimately it probably won't scale for our data volume. I'll probably continiue building on the SWIG base since that's working pret

RE: Python access to HDFS

2008-02-21 Thread dhruba Borthakur
Hi Pete, If you are referring to the ability to re-open a file and append to it, then this feature is not in 0.16. Please see: http://issues.apache.org/jira/browse/HADOOP-1700 Thanks, dhruba -Original Message- From: Pete Wyckoff [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 200

RE: Python access to HDFS

2008-02-21 Thread Roddy Lindsay
I do it the old fashioned way: (w, r) = os.popen2("%s/bin/hadoop dfs -cat %s" % (hadoop_home.rstrip('/'), filename)) -Original Message- From: Pete Wyckoff [mailto:[EMAIL PROTECTED] Sent: Thu 2/21/2008 4:08 PM To: core-user@hadoop.apache.org Subject: Re: Python access to HDFS We're p

Re: Python access to HDFS

2008-02-21 Thread Pete Wyckoff
We're profiling and tuning read performance for fuse dfs and have writes implemented, but I haven 't been able to test it even as I haven't tried 0.16 yet - It requires the ability to create the file, close it and then re-open it to start writing - which can't be done till 16. --pete On 2/21/

Re: Python access to HDFS

2008-02-21 Thread Steve Sapovits
Jeff Hammerbacher wrote: maybe the dfs could expose a thrift interface in future releases? ThruDB exposes Lucene via Thrift but not the underlying HDFS. I just need HDFS access in Python for now. you could also use the FUSE module to mount the dfs and just write to it like any other filesy

Sorting output data on value

2008-02-21 Thread Tarandeep Singh
hi, Can I sort the output of reducer based on the value instead of key. Also can I specify that the output should be sorted in decreasing order ? Mapper output - Reducer gets- and outputs - e.g abc 10 xyz 100 I want the output to be sorted based on the value and that too in decrea

Re: Python access to HDFS

2008-02-21 Thread Jeff Hammerbacher
maybe the dfs could expose a thrift interface in future releases? you could also use the FUSE module to mount the dfs and just write to it like any other filesystem... On Thu, Feb 21, 2008 at 1:23 PM, Steve Sapovits <[EMAIL PROTECTED]> wrote: > > Are there any existing HDFS access packages out t

Working with external libraries?

2008-02-21 Thread Chang Hu
Hi, I have an image processing library in C++ and want to run it as a MapReduce job via JNI. While I have some idea about how to include an external JAR into MapReduce, I am not sure how that works with external C++ libraries. It could be easier to use HadoopStreaming, but I am not sure how to do

Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Ted Dunning
The principles are pretty simple: If the semantics change significantly, then the name should change. Conversely, if the name doesn't change the semantics shouldn't change. That is, unless the changes fix seriously broken old semantics or extend old semantics in a way that old calls don't change

How to split the hdfs in different subgroups

2008-02-21 Thread xavier.quintuna
Hi There, I have a hdfs and I want to split the cluster in two groups. Each groups have a set of datanodes. I want to be able that my client (hdfshell) only can write in one group. One group is in one rack and my other group is in the other rack. Replication between racks is allowed but the client

RE: Questions regarding configuration parameters...

2008-02-21 Thread Joydeep Sen Sarma
> The default value are 2 so you might only see 2 cores used by Hadoop per > node/host. that's 2 each for map and reduce. so theoretically - one could fully utilize a 4 core box with this setting. in practice - a little bit of oversubscription (3 each on a 4 core) seems to be working out well f

Re: Questions regarding configuration parameters...

2008-02-21 Thread Andy Li
Try the 2 parameters to utilize all the cores per node/host. mapred.tasktracker.map.tasks.maximum 7 The maximum number of map tasks that will be run simultaneously by a task tracker. mapred.tasktracker.reduce.tasks.maximum 7 The maximum number of reduce tasks that will be run

Re: define backwards compatibility

2008-02-21 Thread Doug Cutting
Joydeep Sen Sarma wrote: i find the confusion over what backwards compatibility means scary - and i am really hoping that the outcome of this thread is a clear definition from the committers/hadoop-board of what to reasonably expect (or not!) going forward. The goal is clear: code that compil

Python access to HDFS

2008-02-21 Thread Steve Sapovits
Are there any existing HDFS access packages out there for Python? I've had some success using SWIG and the C HDFS code, as documented here: http://www.stat.purdue.edu/~sguha/code.html (halfway down the page) but it's slow adding support for some of the more complex functions. If there's a

Question on clusters split by WAN segments

2008-02-21 Thread Jason Venner
Does anyone run clusters split over WAN segments, and if so, do they have any tips for minimizing issues? -- Jason Venner Attributor - Publish with Confidence Attributor is hiring Hadoop Wranglers, contact if interested

define backwards compatibility (was: changes to compression interfaces in 0.15?)

2008-02-21 Thread Joydeep Sen Sarma
Arun - if you can't pull the api - then u must redirect the api to the new call that preserves it's semantics. in this case - had we re-implemented SequenceFile.setCompressionType in 0.15 to call SequenceFileOutputFormat.setOutputCompressionType() - then it would have been a backwards compatibl

Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Pete Wyckoff
If the API semantics are changing under you, you have to change your code whether or not the API is pulled or deprecated. Pulling it makes it more obvious that the user has to change his/her code. -- pete On 2/21/08 12:41 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote: > > On Feb 21, 2008, at

Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Arun C Murthy
On Feb 21, 2008, at 12:20 PM, Joydeep Sen Sarma wrote: To maintain backward compat, we cannot remove old apis - the standard procedure is to deprecate them for the next release and remove them in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my ap

Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Derek Anderson
seconded. (or thirded? :) the commute from dallas is rough. Tim Wintle wrote: I would certainly appreciate being able to watch them online too, and they would help spread the word about hadoop - think of all the people who watch Google's Techtalks (am I allowed to say the "G" word around here

RE: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Ajay Anand
We do plan to make the video available online after the event. Ajay -Original Message- From: Tim Wintle [mailto:[EMAIL PROTECTED] Sent: Thursday, February 21, 2008 12:22 PM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! I would certainly appreciate being

Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread Tim Wintle
I would certainly appreciate being able to watch them online too, and they would help spread the word about hadoop - think of all the people who watch Google's Techtalks (am I allowed to say the "G" word around here?). On Thu, 2008-02-21 at 08:34 +0100, Lukas Vlcek wrote: > Online webcast/record

RE: changes to compression interfaces in 0.15?

2008-02-21 Thread Joydeep Sen Sarma
> To maintain backward compat, we cannot remove old apis - the standard > procedure is to deprecate them for the next release and remove them > in subsequent releases. you've got to be kidding. we didn't maintain backwards compatibility. my app broke. Simple and straightforward. and the old in

Re: Question on metrics via ganglia solved

2008-02-21 Thread Jason Venner
Instead of localhost, in the servers block, we now put the machine that has gmetad running. dfs.class=org.apache.hadoop.metrics.ganglia.GangliaContext dfs.period=10 dfs.servers=GMETAD_HOST:8649 Jason Venner wrote: Well, with the metrics file changed to perform file based logging, metrics do a

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

2008-02-21 Thread Arun C Murthy
On Feb 20, 2008, at 10:21 PM, ma qiang wrote: Hi all: Here I have two mapreduce program.I need to use the result of the first mapreduce program to computer another values which generate in the second mapreduce program and this intermediate result is not need to save, so I want to run the s

Re: Add your project or company to the powered by page?

2008-02-21 Thread Allen Wittenauer
On 2/21/08 11:34 AM, "Jeff Hammerbacher" <[EMAIL PROTECTED]> wrote: > yeah, i've heard those facebook groups can be a great way to get the word > out... > > anyways, just got approval yesterday for a 320 node cluster. each node has > 8 cores and 4 TB of raw storage so this guy is gonna be pretty

Re: Add your project or company to the powered by page?

2008-02-21 Thread Jeff Hammerbacher
yeah, i've heard those facebook groups can be a great way to get the word out... anyways, just got approval yesterday for a 320 node cluster. each node has 8 cores and 4 TB of raw storage so this guy is gonna be pretty powerful. can we claim largest cluster outside of yahoo? On Thu, Feb 21, 2008

Re: Add your project or company to the powered by page?

2008-02-21 Thread Paco NATHAN
More on the subject of outreach, not specific uses at companies, but... A couple things might help get the word out: - Add a community group in LinkedIn (shows up on profile searches) http://www.linkedin.com/static?key=groups_faq - Add a link on the wiki to the Facebook group about

Re: changes to compression interfaces in 0.15?

2008-02-21 Thread Arun C Murthy
Joydeep, On Feb 20, 2008, at 5:06 PM, Joydeep Sen Sarma wrote: Hi developers, In migrating to 0.15 - i am noticing that the compression interfaces have changed: - compression type for sequencefile outputs used to be set by: SequenceFile.setCompressionType() - now it seem

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

2008-02-21 Thread Amar Kamat
Output of every mapreduce job in Hadoop gets stored in the DFS i.e made visible. You can run back to back jobs (i.e job chaining) but the output wont be temporary. Look at Grep.java as Hairong suggested for more details on job chaining. As of now there is no support for job chaining in Hadoop.

Re: Question on metrics via ganglia

2008-02-21 Thread Jason Venner
Well, with the metrics file changed to perform file based logging, metrics do appear. On digging into the GangliaContext source, it looks like it is using udp for reporting, and we modified the gmond.conf to receive via udp as well as tcp. netstat -a -p shows gmond monitoring 8649 for both tcp a

Re: Add your project or company to the powered by page?

2008-02-21 Thread Doug Cutting
Dennis Kubes wrote: * [http://alpha.search.wikia.com Search Wikia] * A project to help develop open source social search tools. We run a 125 node hadoop cluster. Done. Doug

Re: Add your project or company to the powered by page?

2008-02-21 Thread Dennis Kubes
* [http://alpha.search.wikia.com Search Wikia] * A project to help develop open source social search tools. We run a 125 node hadoop cluster. Derek Gottfrid wrote: The New York Times / nytimes.com -large scale image conversions -http://open.blogs.nytimes.com/2007/11/01/self-service-prorate

Re: Questions about namenode and JobTracker configuration.

2008-02-21 Thread Amar Kamat
Zhang, jian wrote: Hi, All I have a small question about configuration. In Hadoop Documentation page, it says " Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataN

Re: Hadoop summit / workshop at Yahoo!

2008-02-21 Thread John Heidemann
On Wed, 20 Feb 2008 12:10:09 PST, "Ajay Anand" wrote: >The registration page for the Hadoop summit is now up: >http://developer.yahoo.com/hadoop/summit/ >... >Agenda: Ajay, when we talked about the summit on the phone, you were considering having a poster session. I don't see that listed. Shoul

Re: Namenode fails to re-start after cluster shutdown

2008-02-21 Thread Robert Chansler
Thanks for helping the gentle user! On 21 02 08 10:38, "Raghu Angadi" <[EMAIL PROTECTED]> wrote: > > Please file a jira (let me know if need help with that). Did subsequent > tries to restart succeed? > > Thanks, > Raghu. > > André Martin wrote: >> Hi everyone, >> I downloaded the nightly bui

Re: Namenode fails to re-start after cluster shutdown

2008-02-21 Thread Raghu Angadi
Please file a jira (let me know if need help with that). Did subsequent tries to restart succeed? Thanks, Raghu. André Martin wrote: Hi everyone, I downloaded the nightly build (see below) yesterday and after the cluster worked fine for about 10 hours I got the following error message from

Question on metrics via ganglia

2008-02-21 Thread Jason Venner
We have modified my metrics file, distributed it and restarted our cluster. We have gmond running on the nodes, and a machine on the vlan with gmetad running. We have statistics for the machines in the web ui, and our statistics reported by the gmetric program are present. We don't see any hadoo

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

2008-02-21 Thread Hairong Kuang
Take a look at Grep.java under src/examples/org/apache/hadoop/examples. It first runs a grep job and then a sort job. Hairong On 2/20/08 10:21 PM, "ma qiang" <[EMAIL PROTECTED]> wrote: > Hi all: > Here I have two mapreduce program.I need to use the result of the > first mapreduce program t

Re: how to set the result of the first mapreduce program as the input of the second mapreduce program?

2008-02-21 Thread Paco NATHAN
Hi Qiang, Here is what I understand: Pass 1 - generate "intermediate dataset" as output from its reduce phase Pass 2 - take "intermediate dataset" as input - produce some result (an aggregate?) - no need to persist the "intermediate dataset" Would it be possible to collapse this in

Re: Add your project or company to the powered by page?

2008-02-21 Thread Eric Baldeschwieler
done On Feb 21, 2008, at 2:18 AM, André Martin wrote: Hi Eric, here you go: SEDNS Group - http://wwwse.inf.tu-dresden.de/SEDNS We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed. Cu on the 'net,

Re: java error

2008-02-21 Thread Ted Dunning
In general, you will find it much harder to deploy any hadoop system under cygwin than under linux. On 2/20/08 8:13 PM, "Jaya Ghosh" <[EMAIL PROTECTED]> wrote: > Hello, > > > > As per my earlier mails I could not deploy Nutch on Linux . Now am > attempting the same using cygwin as per the t

java error

2008-02-21 Thread Jaya Ghosh
Hello, As per my earlier mails I could not deploy Nutch on Linux . Now am attempting the same using cygwin as per the tutorial by Peter Wang. Can someone from the list help me resolve the attached error? Atleast on Linux I could run the crawl. java.lang.NoClassDefFoundError: org/apache/hado

Re: Problem with LibHDFS

2008-02-21 Thread Jaideep Dhok
Type 'bt' on the gdb prompt after you get the segfault. It will direct you to the line where segfault occurred. - Jaideep On Thu, Feb 21, 2008 at 9:52 PM, Raghavendra K <[EMAIL PROTECTED]> wrote: > When I try with gdb, i receive the following output > (gdb) r > Starting program: > /garl/garl-alpha

Re: Problem with LibHDFS

2008-02-21 Thread Raghavendra K
When I try with gdb, i receive the following output (gdb) r Starting program: /garl/garl-alpha1/home1/raghu/Desktop/hadoop-0.15.3/src/c++/libhdfs/hdfs_test Program received signal SIGSEGV, Segmentation fault. 0x0001 in ?? () Dont know what to make out of it. On Thu, Feb 21, 2008 at 5:06 PM,

Re: Add your project or company to the powered by page?

2008-02-21 Thread Derek Gottfrid
The New York Times / nytimes.com -large scale image conversions -http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun/ On Thu, Feb 21, 2008 at 1:26 AM, Eric Baldeschwieler <[EMAIL PROTECTED]> wrote: > Hi Folks, > > Let's get the word out that Hadoop is being used and

Namenode fails to re-start after cluster shutdown

2008-02-21 Thread André Martin
Hi everyone, I downloaded the nightly build (see below) yesterday and after the cluster worked fine for about 10 hours I got the following error message from the DFS client even all data nodes were up: 08/02/21 14:04:35 INFO fs.DFSClient: Could not obtain block blk_-400895070464649 0788 from a

Re: Problem with LibHDFS

2008-02-21 Thread Miles Osborne
Since you are compiling a C(++) program, why not add the -g switch and run it within gdb: that will tell people which line it crashes at (etc etc) Miles On 21/02/2008, Raghavendra K <[EMAIL PROTECTED]> wrote: > > Hi, > I am able to get Hadoop running and also able to compile the libhdfs. > But

Problem with LibHDFS

2008-02-21 Thread Raghavendra K
Hi, I am able to get Hadoop running and also able to compile the libhdfs. But when I run the hdfs_test program it is giving Segmentation Fault. Just a small program like this #include "hdfs.h" int main() { return(0); } and compiled using the command gcc -ggdb -m32 -I/garl/garl-alpha1/home1/raghu/

Re: Add your project or company to the powered by page?

2008-02-21 Thread André Martin
Hi Eric, here you go: SEDNS Group - http://wwwse.inf.tu-dresden.de/SEDNS We are gathering world wide DNS data in order to discover content distribution networks and configuration issues utilizing Hadoop DFS and MapRed. Cu on the 'net, Bye - bye,

Re: Add your project or company to the powered by page?

2008-02-21 Thread Eric Baldeschwieler
done On Feb 20, 2008, at 11:33 PM, Miles Osborne wrote: Please could you add this text: At ICCS http://www.iccs.informatics.ed.ac.uk/ We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research act

Questions about namenode and JobTracker configuration.

2008-02-21 Thread Zhang, jian
Hi, All I have a small question about configuration. In Hadoop Documentation page, it says " Typically you choose one machine in the cluster to act as the NameNode and one machine as to act as the JobTracker, exclusively. The rest of the machines act as both a DataNode and TaskTracker and