Re: Hadoop on EC2 - public AMIs in hadoop-images

2009-09-07 Thread Todd Lipcon
release (CDH2) we're now also running jdiff between the stock Apache release and our own so as to verify the above guarantee. -Todd 2009/9/7 Todd Lipcon t...@cloudera.com Hi, The EC2 scripts will boot Cloudera's distribution for Hadoop. Currently they boot our distribution

Re: tmpfiles and tmpjars do not get unarchived

2009-08-24 Thread Todd Lipcon
Hi Zheng, The DistributedCache.addArchiveToClasspath call is the one that makes it get unarchived into the temp directory. By contrast, addFileToClasspath doesn't. I don't remember the old-style command line flag to trigger this call... perhaps -archives or something? Worth noting that -libjars

Re: Cluster Disk Usage

2009-08-21 Thread Todd Lipcon
Hi Arvind, Check the source code in DFSAdmin which handles dfsadmin -report. It uses the same API that the namenode web UI does - I think it's called getClusterStatus or something if my memory serves me correctly. Here's example output on my pseudodistributed cluster: Datanodes available: 1 (1

Re: fuse-dfs then samba mount

2009-08-13 Thread Todd Lipcon
On Thu, Aug 13, 2009 at 12:04 AM, Manhee Jo j...@nttdocomo.com wrote: Hi all, I've succeeded in sharing hdfs files from windows xp through fuse-dfs then samba mount. When I tried to copy (read and write) 1GB text file from fuse-dfs over samba, it took around 50 secs. Then, I tried dfs get

Re: HADOOP-4539 question

2009-08-13 Thread Todd Lipcon
On Thu, Aug 13, 2009 at 10:37 AM, Konstantin Shvachko s...@yahoo-inc.comwrote: Steve, There are other groups claimed they work on HA solution. We had discussions about it not so long ago in this list. Is it possible that your colleagues present their design? As you point out the issue gets

Re: Intermediary Data on Fair Scheduler

2009-08-13 Thread Todd Lipcon
Hi Mithila, I assume you're referring to fair scheduler preemption. In the preemption scenario, tasks are completely killed, not paused. It's not like a preemptive scheduler in your OS where things are context switched. This is why the preemption is not enabled by default and has tuning

Re: Intermediary Data on Fair Scheduler

2009-08-13 Thread Todd Lipcon
. Intermediate data from the big job will be on the local disk like it always is - this isn't anything special about the fair scheduler. Map outputs remain in mapred.local.dir until the job is complete. -Todd On Thu, Aug 13, 2009 at 10:52 AM, Todd Lipcon t...@cloudera.com wrote: Hi Mithila, I

Re: What OS?

2009-08-13 Thread Todd Lipcon
On Thu, Aug 13, 2009 at 8:58 PM, Bogdan M. Maryniuk bogdan.maryn...@gmail.com wrote: Also make sure you tuned TCP/IP stack, which is by default too conservative. Any pointers on this? Would be interesting to see before/after tuning benchmarks as well. Assuming this is a runtime tunable

Re: NN + secondary got full, even though data nodes had plenty of space

2009-08-12 Thread Todd Lipcon
Hi Mayuran, Do you do all of your uploads of data into your Hadoop cluster from node001 and node002? If so, keep in mind that one of your replicas will always be written on localhost in the case that it is part of the cluster. You should consider running the rebalancer to even up your space

Re: HADOOP-4539 question

2009-08-12 Thread Todd Lipcon
things. Thanks, --Konstantin Todd Lipcon wrote: On Wed, Aug 12, 2009 at 3:42 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. You can also use a utility like Linux-HA (aka heartbeat) to handle IP address failover. It will even send gratuitous ARPs to make sure to get the new mac

Re: HADOOP-4539 question

2009-08-11 Thread Todd Lipcon
Hey Stas, You can also use a utility like Linux-HA (aka heartbeat) to handle IP address failover. It will even send gratuitous ARPs to make sure to get the new mac address registered after a failover. Check out this blog for info about a setup like this:

Re: Extra 4 bytes at beginning of serialized file

2009-08-11 Thread Todd Lipcon
BytesWritable serializes itself by first outputting the array length, and then outputting the array itself. The 4 bytes at the top of the file are the length of the value itself. Hope that helps -Todd On Tue, Aug 11, 2009 at 6:33 PM, Kris Jirapinyo kjirapi...@biz360.comwrote: Hi all, I was

Re: network interface trunking

2009-08-05 Thread Todd Lipcon
Hi Ryan, Yes, you can do this -- the term is called interface bonding and isn't too hard to set up in Linux as long as your switch supports it. However, it is pretty rare that it provides an appreciable performance benefit on typical hardware and workloads -- probably not worth the doubled switch

Re: Map performance with custom binary format

2009-07-30 Thread Todd Lipcon
On Thu, Jul 30, 2009 at 11:39 AM, Scott Carey sc...@richrelevance.comwrote: Use the deadline scheduler: # echo 'deadline' /sys/block/sda/queue/scheduler(for each device) Have you found the deadline scheduler to be significantly better than the default cfq? I've used deadline for RDBMS

Re: To retrieve data on dead node

2009-07-29 Thread Todd Lipcon
On Wed, Jul 29, 2009 at 8:51 AM, bhushan_mahale bhushan_mah...@persistent.co.in wrote: Hi, What are the possible ways to retrieve the data if a node goes down in a Hadoop cluster? Assuming replication factor as 3, and 3 nodes goes down in a 10 node cluster, how do we retrieve the data?

Re: A few questions about Hadoop and hard-drive failure handling.

2009-07-23 Thread Todd Lipcon
On Thu, Jul 23, 2009 at 11:56 AM, Ryan Smith ryan.justin.sm...@gmail.comwrote: I was wondering if someone could give me some answers or maybe some pointers where to look in the code. All these questions are in the same vein of hard drive failure. Question 1: If a master (system

Re: .tar.gz codec class implementation

2009-07-21 Thread Todd Lipcon
Hi Andraz, First, thanks for the contribution. Could you create a JIRA ticket and upload the code there? Due to ASF restrictions, all contributions must be attached to a JIRA so you can officially grant permission to include the code. The JIRA will also allow others to review and comment on the

Re: DiskErrorException and Error reading task output

2009-07-18 Thread Todd Lipcon
Hi Akhil, Your mapred.local.dir is pointing to a directory which either does not have permissions for the user running the daemon, or has been removed. Check that configuration variable and make sure it's pointing to a directory that's writable by the hadoop user. -Todd On Sat, Jul 18, 2009 at

Re: question on map merge-process

2009-07-17 Thread Todd Lipcon
. -Todd On Fri, Jul 17, 2009 at 12:54 AM, Todd Lipcon t...@cloudera.com wrote: Hi, Your understanding of the merge sort process seems correct, but I'm not quite sure what your question is. The merge process here is on the output side of the map task, so input splits don't factor

Re: Why /tmp directory?

2009-07-17 Thread Todd Lipcon
Hi Akhil, That's the default configuration, but it's not meant for actual use in a cluster. You should be manually setting dfs.data.dir, dfs.name.dir, and mapred.local.dir on your cluster to point to the disks you want Hadoop to use. The use of /tmp as a default is because it's a convenient

Re: Data-local map tasks lower than Launched map tasks even with full replication

2009-07-17 Thread Todd Lipcon
Hi Suenghwa, It's important to note that changing the dfs.replication config variable does not change the current files in HDFS. You have to use fs -setrep on those files to change their replication count. The replication count is set when the files were created and not modified thereafter unless

Re: Data-local map tasks lower than Launched map tasks even with full replication

2009-07-17 Thread Todd Lipcon
On Fri, Jul 17, 2009 at 4:16 PM, Seunghwa Kang s.k...@gatech.edu wrote: I checked with bin/hadoop fs -stat %n %r input/* part-0 4 part-1 4 part-2 4 part-3 4 part-4 4 part-5 4 part-6 4 part-7 4 and see replication factor is 4. Also, I set replication

Re: Compiling libhdfs in 0.20.0 release

2009-07-14 Thread Todd Lipcon
Hi Ryan, To fix this you can simply chmod 755 that configure script referenced in the error. There is a JIRA for this that I think got committed that adds another chmod task to build.xml, but it may not be in 0.20.0. Thanks -Todd On Tue, Jul 14, 2009 at 11:36 AM, Ryan Smith

Re: Compiling libhdfs in 0.20.0 release

2009-07-14 Thread Todd Lipcon
/home/rsmith/hadoop-0.20.0/build.xml:1405: exec returned: 2 Total time: 5 seconds -- On Tue, Jul 14, 2009 at 3:32 PM, Todd Lipcon t...@cloudera.com wrote: Hi Ryan, I've never seen that issue. It sounds to me like your C

Re: Compiling libhdfs in 0.20.0 release

2009-07-14 Thread Todd Lipcon
PM, Todd Lipcon t...@cloudera.com wrote: Hi Ryan, Sounds like HADOOP-5611: https://issues.apache.org/jira/browse/HADOOP-5611 -Todd On Tue, Jul 14, 2009 at 12:49 PM, Ryan Smith ryan.justin.sm...@gmail.com wrote: Hello, My problem was I didnt have g++ installed. :) So

Re: Job Startup Time

2009-07-13 Thread Todd Lipcon
Hi Mu, Small job overhead is something that has been worked on a bit in recent versions, but here's the gist of it (as best as I know, though I don't work much in this area of the code): - The JobTracker doesn't assign tasks forcefully to TaskTrackers. Instead, the TaskTrackers send heartbeats

Re: Native libraries for multiple architectures?

2009-07-10 Thread Todd Lipcon
Hi Stuart, Hadoop itself doesn't have any nice way of dealing with this that I know of. I think your best bet is to do something like: String dataModel = System.getProperty(sun.arch.data.model); if (32.equals(dataModel)) { System.loadLibrary(mylib_32bit); } elseif (64.equals(dataModel)) {

Re: Sort-Merge Join using Map Reduce.

2009-07-09 Thread Todd Lipcon
Hi Pankil, Basically there are two steps here - the first is to sort the two files. This can be done using an mapreduce where the mapper extracts the join column as a key. If you make sure you have the same number of reducers (and partition by the equijoin column) for both sorts, then you'll end

Re: Sort-Merge Join using Map Reduce.

2009-07-09 Thread Todd Lipcon
, I got the concept but I have no idea about side input in mapper class. Can you guide me more on that? Pankil On Thu, Jul 9, 2009 at 1:39 PM, Todd Lipcon t...@cloudera.com wrote: Hi Pankil, Basically there are two steps here - the first is to sort the two files. This can be done

Re: HDFS and long-running processes

2009-07-03 Thread Todd Lipcon
Hi David, I'm unaware of any issue that would cause memory leaks when a file is open for read for a long time. There are some issues currently with write pipeline recovery when a file is open for writing for a long time and the datanodes to which it's writing fail. So, I would not recommend

<    1   2   3