Re: migrate cluster to different datacenter

2012-08-07 Thread Patrick Angeles
It would help to know your data ingest and processing patterns (and any applicable SLAs). In most cases, you'd only need to move the raw ingested data, then you can derive the rest in the other cluster. Assuming that you have some sort of date-based partitioning on the ingest, then it's easy to de

Re: Hadoop Datacenter Setup

2012-01-30 Thread Patrick Angeles
Hey Aaron, I'm still skeptical when it comes to flash drives, especially as pertains to Hadoop. The write cycle limit is impractical to make them usable for dfs.data.dir and mapred.local.dir, and as you pointed out, you can't use them for logs either. If you put HADOOP_LOG_DIR in /mnt/d0, you wil

Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Patrick Angeles
kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH & done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote: > Thanks harsh ! > That means b

Re: Hadoop Framework questions.

2011-02-01 Thread Patrick Angeles
ta? > 4 Any idea about teh loss of efficiency due to the infrastructure? > 5. The 256 node cluster seems to provide an efficieny of 1.3 while 512 > nodes it decreases to around 1.2. Would this trend continue? > Yes, because of the network capacity. > > Raj > > Raj > >

Re: Hadoop Framework questions.

2011-02-01 Thread Patrick Angeles
Hi, Raj. Interesting analysis... These numbers appear to be off. For example, 405s for mappers + 751s for reducers = 1156s for all tasks. If you have 2000 map and reduce tasks, this means each task is spending roughly 500ms to do actual work. That is a very low number and seems impossible. - P

Re: Has anyone tried rolling upgrade in hadoop?

2010-11-10 Thread Patrick Angeles
This is currently not possible with Hadoop, as the communications protocols between clients and servers have to be the same version. On Wed, Nov 10, 2010 at 6:25 AM, Gokulakannan M wrote: > Hi all, > > > >It will be a good sharing if some tips are given in performing > rolling upgrad

Re: Hadoop partitions Problem

2010-11-08 Thread Patrick Angeles
There could be a number of reasons. It could be directory permissions problems with the partitions (user 'hadoop' cannot rwx). It could be typos in the dfs.data.dir config. The directories are checked on datanode startup only. Regards, - Patrick On Mon, Nov 8, 2010 at 9:42 AM, Sudhir Vallamkond

Re: Large amount of corruption after balancer

2010-10-27 Thread Patrick Angeles
Nick, The corruption may have been caused by running out of disk space. At that point, even after rebalancing, you will still have corruption. Under normal circumstances, balancing by itself should not result in corruption. Regards, - Patrick On Wed, Oct 27, 2010 at 9:40 AM, Jones, Nick wrote:

Re: Small File Management

2010-10-26 Thread Patrick Angeles
HBase might fit the bill. On Tue, Oct 26, 2010 at 12:28 PM, Ananth Sarathy wrote: > I was wondering if there were any projects out there doing a small file > management layer on top of Hadoop? I know that HDFS is primarily for > map/reduce but I think companies are going to start using hdfs clus

Re: CDH3 beta 3

2010-10-26 Thread Patrick Angeles
This is not CDH3 specific... it's related to the kerberos security patch, so these upgrade issues will pop up in the Y! distribution, and eventually in 0.22 as well. These aren't bugs in the code per se, it's just that the upgrade process going from pre- to post- security is somewhat tricky, and c

Re: Problem with DistributedCache after upgrading to CDH3b2

2010-10-06 Thread Patrick Angeles
Kim, Jamie, This might be a particular issue with the Cloudera distro, specifically with the AsyncDiskService related patches that were applied to 0.20.2+320 (aka CDH3b2). I've created an issue here: https://issues.cloudera.org/browse/DISTRO-39 I encourage you (and anyone else reading this) to

Re: Hadoop 0.20.2: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201008131730_0001/attempt_201008131730_0001_m_000000_2/output/file.out.index in any of

2010-08-16 Thread Patrick Angeles
I'd also recommend setting mapred.local.dir and dfs.data.dir to something that is not under /tmp. Aside from your HDFS data getting wiped, these settings should ideally be comma separated paths, one for each physical disk in your server so you can aggregate disk I/O. 2010/8/15 Kevin . > > Hi, H

Re: MultipleOutputFormat

2010-08-16 Thread Patrick Angeles
In this case, don't bother with MultipleOutput. Specify 2 reducers, and a custom partitioner that sends 'even' records to partition 0, and 'odd' partitions to partition 1. You will have two output files named 'part-0' and 'part-1' corresponding to odd and even. On Mon, Aug 16, 2010 at 2:

Re: java.lang.OutOfMemoryError: Java heap space

2010-07-12 Thread Patrick Angeles
Shuja, Those settings (mapred.child.jvm.opts and mapred.child.ulimit) are only used for child JVMs that get forked by the TaskTracker. You are using Hadoop streaming, which means the TaskTracker is forking a JVM for streaming, which is then forking a shell process that runs your groovy code (in an

Re: decommissioning nodes help

2010-07-08 Thread Patrick Angeles
Arun, Did you specify dfs.hosts.exclude before the NameNode started? If not, you will have to restart the NameNode. Otherwise, just kill the DataNode. On Thu, Jul 8, 2010 at 10:01 PM, Arun Ramakrishnan < aramakrish...@languageweaver.com> wrote: > When I run fsck everything seems fine. Nothing is

Re: HDFS without Consideration for Map and Reduce

2010-07-06 Thread Patrick Angeles
If all you want is dumb storage for small-ish files, you can always just use NAS or SAN. For the MP3 example, you might want to consider HBase... you can store associated meta-data in column families. On Tue, Jul 6, 2010 at 3:33 PM, Ananth Sarathy wrote: > So I am aware of the problem with small

Re: How to create a temporary HDFS file system using Java?

2010-06-09 Thread Patrick Angeles
Hey Stan, There's really no way to programmatically spin up an HDFS cluster. What's your actual goal? Regards, - Patrick p.s., Thanks for all the great comix! ;-) On Wed, Jun 9, 2010 at 4:48 AM, stan lee wrote: > Hi Experts, > > Although HDFS file system has exposed some APIs which can be us

Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
s) he only > metric that matters, it seem to me lie something very interresting to check > out... > I have hierarchy over me and they will be happy to understand my choices > with real numbers to base their understanding on. > Thanks. > > > On Tue, May 18, 2010 at 5:00 PM, P

Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
Should be evident in the total job running time... that's the only metric that really matters :) On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT wrote: > Thank you, > Any way I can measure the startup overhead in terms of time? > > > On Tue, May 18, 2010 at 4:27 PM, Pat

Re: Any possible to set hdfs block size to a value smaller than 64MB?

2010-05-18 Thread Patrick Angeles
Pierre, Adding to what Brian has said (some things are not explicitly mentioned in the HDFS design doc)... - If you have small files that take up < 64MB you do not actually use the entire 64MB block on disk. - You *do* use up RAM on the NameNode, as each block represents meta-data that needs to b

Re: Best Way Repartitioning a 100-150TB Table

2010-05-10 Thread Patrick Angeles
Matias, Hive partitions map to subdirectories in HDFS. You can do a 'mv' if you're lucky enough to have each partition in a distinct HDFS file that could be moved to the right partition subdirectory. Otherwise, you can run a MapReduce job to collate your data into separate files per partition. You

Re: Sensage to Hadoop conversion?

2010-04-29 Thread Patrick Angeles
>From what I understand about Sensage, they collect enterprise data to facilitate compliance driven audits. Of course this can be done, and done very well in Hadoop. But, at the moment there are no specific off-the-shelf compliance products based on Hadoop that you can just drop into your environme

Re: Algorithm used "Shuffle and Sort" step

2010-04-28 Thread Patrick Angeles
Dan, Shuffle and Sort is a combination of multiple 'algorithms'. - Map output goes to a circular, in-memory buffer - When this starts filling up, it gets 'spilled' to disk - Spilling involves writing each K/V pair to a partition specific file (where partition is the algorithm Jim describes below)

Re: Hadoop overhead

2010-04-08 Thread Patrick Angeles
Packaging the job and config and sending it to the JobTracker and various nodes also adds a few seconds overhead. On Thu, Apr 8, 2010 at 10:37 AM, Jeff Zhang wrote: > By default, for each task hadoop will create a new jvm process which will > be > the major cost in my opinion. You can customize

Re: losing network interfaces during long running map-reduce jobs

2010-04-02 Thread Patrick Angeles
Hi David, Strange indeed. I assume nothing in your configs changed. Anything funny in the logs? You should also rule out the switch itself as being faulty. It's possible that CDH2 has a patch that's not in 0.20.1 that's causing this problem, but we haven't heard this exact problem from any of our

Re: Manually splitting files in blocks

2010-03-26 Thread Patrick Angeles
My understanding (please correct me, list) is that hadoop will always spit > your files based on the block size setting. The InputSplit and > RecordReaders > are used by jobs to retrieve chunks of files for processing - that is, > there > are two separate splits happening here: one "physical" split

Re: Manually splitting files in blocks

2010-03-24 Thread Patrick Angeles
Yuri, Probably the easiest thing is to actually create distinct files and configure the block size per file such that HDFS doesn't split it into smaller blocks for you. - P On Wed, Mar 24, 2010 at 11:23 AM, Yuri K. wrote: > > Dear Hadoopers, > > i'm trying to find out how and where hadoop spli

Re: Efficiently Stream into Sequence Files?

2010-03-15 Thread Patrick Angeles
Scott, The code you have below should work, provided that the 'outputPath' points to an HDFS file. The trick is to get FTP/SCP access to the remote files using a Java client and receive the contents into a byte buffer.You can then set that byte buffer into your BytesWritable and call writer.append

Re: I want to group "similar" keys in the reducer.

2010-03-15 Thread Patrick Angeles
You can use a custom Partitioner to send keys to a specific reducer. Note that your reducer will still process one key at a time. On Mon, Mar 15, 2010 at 1:26 PM, Raymond Jennings III wrote: > Is it possible to override a method in the reducer so that similar keys > will be grouped together? Fo

Re: New or old Map/Reduce API ?

2010-02-23 Thread Patrick Angeles
Hello Kai, To answer your questions: - Most of the missing stuff from the new API are convenience classes -- InputFormats, OutputFormats, etc. One very handy class that is missing from the new API is MultipleOutputs which allows you to write multiple files in a single pass. - You cannot mix class

Re: hosting recommandation for a small hadoop cluster?

2010-02-16 Thread Patrick Angeles
Thomas, Owning your machines and renting a 1/2 cabinet in a colo facility is the cheapest way to go in the long run. That said, you could also try www.softlayer.com. You'll get a good idea of the pricing up front as they allow you to configure everything on the website. You can also get the Cloud

Re: io.sort.mb configuration?

2009-12-22 Thread Patrick Angeles
You can also set that param per-job. Maybe you called some code that did that behind the scenes? On Tue, Dec 22, 2009 at 11:10 AM, Mark Vigeant wrote: > Hey Everyone- > > I've been playing around with Hadoop and Hbase for a while and I noticed > that when running a program to upload data into an

Re: how does hadoop work?

2009-12-21 Thread Patrick Angeles
DS, What you say is true, but there are finer points: 1. Data transfer can begin while the mapper is working through the data. You would still bottleneck on the network if: (a) you have enough nodes and spindles such that the aggregate disk transfer speed is greater than the network c

Re: Please help me to understand this error messages (-;

2009-11-30 Thread Patrick Angeles
The '_' character is not legal for hostnames. On Mon, Nov 30, 2009 at 4:25 PM, pavel kolodin wrote: > > Namenode won't start with this messages: > > hadoop-0.20.1/logs/hadoop-hadoop-namenode-hadoop_master.log: > > http://pastebin.com/m359b9e24 > > Thank you. >

Re: Identifying lines in map()

2009-11-29 Thread Patrick Angeles
Interesting... you have more tokens per line than total lines? LineRecordReader conveys the line number as the key in the mapper. If I understand correctly, though, that line number is relative to the input split, so you could probably use a combination of line number and task ID. However, based

Re: Processing 10MB files in Hadoop

2009-11-27 Thread Patrick Angeles
What does the data look like? You mention 30k records, is that for 10MB or for 600MB, or do you have a constant 30k records with vastly varying file sizes? If the data is 10MB and you have 30k records, and it takes ~2 mins to process each record, I'd suggest using map to distribute the data acros

Re: part-00000.deflate as output

2009-11-27 Thread Patrick Angeles
You can always do hadoop fs -text This will 'cat' the file for you, and decompress it if necessary. On Thu, Nov 26, 2009 at 7:59 PM, Mark Kerzner wrote: > It worked! > > But why is it "for testing?" I only have one job, so I need by related as > text, can I use this fix all the time? > > Than

Re: Secondary NameNodes or NFS exports?

2009-10-22 Thread Patrick Angeles
>From what I understand, it's rather tricky to set up multiple secondary namenodes. In either case, running multiple 2ndary NNs doesn't get you much. See this thread: http://www.mail-archive.com/core-u...@hadoop.apache.org/msg06280.html On Wed, Oct 21, 2009 at 10:44 AM, Stas Oskin wrote: > To cl

Re: Hardware Setup

2009-10-15 Thread Patrick Angeles
On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo wrote: > > >>No need for dedicated SATA drives with > >>RAID for your OS. Most of that is accessed during boot time so it won't > >>contend that much with HDFS. > > You may want to RAID your OS. If you lose a datanode with a large > volume of data

Re: Hardware Setup

2009-10-15 Thread Patrick Angeles
After the discount, an equivalently configured Dell comes about 10-20% over the Silicon Mechanics price. It's close enough that unless you're spending 100k it won't make that much of a difference. Talk to a rep, call them out on the ridiculous drive pricing, buy at the end of their fiscal quarter.

Re: Hardware performance from HADOOP cluster

2009-10-15 Thread Patrick Angeles
Hi Tim, I assume those are single proc machines? I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we have dual quad Nehalems (2.26Ghz). On Thu, Oct 15, 2009 at 11:34 AM, tim robertson wrote: > Hi Usmam, > > So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on

normal hadoop errors?

2009-10-15 Thread Patrick Angeles
I got the following error while running the example sort program (hadoop 0.20) on a brand new Hadoop cluster (using the Cloudera distro). The job seems to have recovered. However I'm wondering if this is normal or should I be checking for something. attempt_200910051513_0009_r_05_0: 09/10/15

Re: Advice on new Datacenter Hadoop Cluster?

2009-10-01 Thread Patrick Angeles
0, 2009 at 9:06 PM, Ted Dunning > wrote: > > > 2TB drives are just now dropping to parity with 1TB on a $/GB basis. > > > > If you want space rather than speed, this is a good option. If you want > > speed rather than space, more spindles and smaller disks are better. >

Re: Advice on new Datacenter Hadoop Cluster?

2009-09-30 Thread Patrick Angeles
We went with 2 x Nehalems, 4 x 1TB drives and 24GB RAM. The ram might be overkill... but it's DDR3 so you get either 12 or 24GB. Each box has 16 virtual cores so 12GB might not have been enough. These boxes are around $4k each, but can easily outperform any $1K box dollar per dollar (and performanc