Re: HDFS Safemode and EC2 EBS?

2009-06-25 Thread Tom White
Hi Chris, You should really start all the slave nodes to be sure that you don't lose data. If you start fewer than #nodes - #replication + 1 nodes then you are virtually guaranteed to lose blocks. Starting 6 nodes out of 10 will cause the filesystem to remain in safe mode, as you've seen. BTW I'm

Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
You can change the value of hadoop.root.logger in conf/log4j.properties to change the log level globally. See also the section "Custom Logging levels" in the same file to set levels on a per-component basis. You can also use hadoop daemonlog to set log levels on a temporary basis (they are reset o

Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
Hi Usman, Before the rebalancer was introduced one trick people used was to increase the replication on all the files in the system, wait for re-replication to complete, then decrease the replication to the original level. You can do this using hadoop fs -setrep. Cheers, Tom On Thu, Jun 25, 2009

Re: Unable to run Jar file in Hadoop.

2009-06-25 Thread Tom White
Hi Krishna, You get this error when the jar file cannot be found. It looks like /user/hadoop/hadoop-0.18.0-examples.jar is an HDFS path, when in fact it should be a local path. Cheers, Tom On Thu, Jun 25, 2009 at 9:43 AM, krishna prasanna wrote: > Oh! thanks Shravan > > Krishna. > > > >

Re: Problem with setting up the cluster

2009-06-25 Thread Tom White
Have a look at the datanode log files on the datanode machines and see what the error is in there. Cheers, Tom On Thu, Jun 25, 2009 at 6:21 AM, .ke. sivakumar wrote: > Hi all, I'm a student and I have been tryin to set up the hadoop cluster for > a while > but have been unsuccessful till now. > >

Re: Is it possible? I want to group data blocks.

2009-06-24 Thread Tom White
You might be interested in https://issues.apache.org/jira/browse/HDFS-385, where there is discussion about how to add pluggable block placement to HDFS. Cheers, Tom On Tue, Jun 23, 2009 at 5:50 PM, Alex Loddengaard wrote: > Hi Hyunsik, > > Unfortunately you can't control the servers that blocks g

Re: Looking for correct way to implements WritableComparable in Hadoop-0.17

2009-06-24 Thread Tom White
Hi Kun, The book's code is for 0.20.0. In Hadoop 0.17.x WritableComparable was not generic, so you need a declaration like: public class IntPair implements WritableComparable { } And the compareTo() method should look like this: public int compareTo(Object o) { IntPair ip = (IntPair) o;

Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing

2009-06-24 Thread Tom White
Hi Saptarshi, The group permissions open the firewall ports to enable access, but there are no shared keys on the cluster by default. See https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the scripts that shares keys to allow SSH access between machines in the cluster. Cheers, Tom

Re: Running Hadoop/Hbase in a OSGi container

2009-06-11 Thread Tom White
Hi Ninad, I don't know if anyone has looked at this for Hadoop Core or HBase (although there is this Jira: https://issues.apache.org/jira/browse/HADOOP-4604), but there's some work for making ZooKeeper's jar OSGi compliant at https://issues.apache.org/jira/browse/ZOOKEEPER-425. Cheers, Tom On Th

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Tom White
Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. Ian, try putting the options after the classname and see if that helps. Otherwise, it would be useful to see a snippet o

Re: SequenceFile and streaming

2009-05-28 Thread Tom White
Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe wrote: > Hello >  I am a new user and I would like to use hadoop streaming with > SequenceFile in both input and output side. > > -The first difficoulty arises from the lack of a simple tool to generate > a SequenceFile starting from a set

Re: InputFormat for fixed-width records?

2009-05-28 Thread Tom White
Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read f

Re: avoid custom crawler getting blocked

2009-05-27 Thread Tom White
Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has solved this kind of problem. Cheers, Tom On Wed, May 27, 2009 at 9:58 AM, John Clarke wrote: > My current project is to gather stats from a lot of different documents. > We're are not indexing just getting quite specific stat

Re: When directly writing to HDFS, the data is moved only on file close

2009-05-26 Thread Tom White
This feature is not available yet, and is still under active discussion. (The current version of HDFS will make the previous block available to readers.) Michael Stack gave a good summary on the HBase dev list: http://mail-archives.apache.org/mod_mbox/hadoop-hbase-dev/200905.mbox/%3c7c962aed090523

Re: RandomAccessFile with HDFS

2009-05-25 Thread Tom White
RandomAccessFile isn't supported directly, but you can seek when reading from files in HDFS (see FSDataInputStream's seek() method). Writing at an arbitrary offset in an HDFS file is not supported however. Cheers, Tom On Sun, May 24, 2009 at 1:33 PM, Stas Oskin wrote: > Hi. > > Any idea if Rando

Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Tom White
You can't use it yet, but https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable interface to place replicas of blocks in HDFS) would enable you to write your own policy so blocks are never placed locally. Might be worth following its development to check it can meet your need? Chee

Re: Tutorial on building an AMI

2009-05-22 Thread Tom White
Hi Saptarshi, You can use the guide at http://wiki.apache.org/hadoop/AmazonEC2 to run Hadoop 0.19 or later on EC2. It includes instructions for building your own customized AMI. Cheers, Tom On Fri, May 22, 2009 at 7:11 PM, Saptarshi Guha wrote: > Hello, > Is there a tutorial available to build

Re: multiple results for each input line

2009-05-21 Thread Tom White
red me in > the right direction! > Thanks > John > > 2009/5/20 Tom White > >> Hi John, >> >> You could do this with a map only-job (using NLineInputFormat, and >> setting the number of reducers to 0), and write the output key as >> docnameN,stat1,stat2,st

Re: Shutdown in progress exception

2009-05-21 Thread Tom White
On Wed, May 20, 2009 at 10:22 PM, Stas Oskin wrote: >> >> You should only use this if you plan on manually closing FileSystems >> yourself from within your own shutdown hook. It's somewhat of an advanced >> feature, and I wouldn't recommend using this patch unless you fully >> understand the ramif

Re: Number of maps and reduces not obeying my configuration

2009-05-21 Thread Tom White
On Thu, May 21, 2009 at 5:18 AM, Foss User wrote: > On Wed, May 20, 2009 at 3:18 PM, Tom White wrote: >> The number of maps to use is calculated on the client, since splits >> are computed on the client, so changing the value of mapred.map.tasks >> only on the jobtracker wil

Re: Shutdown in progress exception

2009-05-20 Thread Tom White
Looks like you are trying to copy file to HDFS in a shutdown hook. Since you can't control the order in which shutdown hooks run, this is won't work. There is a patch to allow Hadoop's FileSystem shutdown hook to be disabled so it doesn't close filesystems on exit. See https://issues.apache.org/jir

Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Tom White
On Fri, May 15, 2009 at 11:06 PM, Owen O'Malley wrote: > > On May 15, 2009, at 2:05 PM, Aaron Kimball wrote: > >> In either case, there's a dependency there. > > You need to split it so that there are no cycles in the dependency tree. In > the short term it looks like: > > avro: > core: avro > hd

Re: multiple results for each input line

2009-05-20 Thread Tom White
Hi John, You could do this with a map only-job (using NLineInputFormat, and setting the number of reducers to 0), and write the output key as docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes that you calculate all 12 statistics in one map. Each output file would have a single l

Re: Number of maps and reduces not obeying my configuration

2009-05-20 Thread Tom White
The number of maps to use is calculated on the client, since splits are computed on the client, so changing the value of mapred.map.tasks only on the jobtracker will not have any effect. Note that the number of map tasks that you set is only a suggestion, and depends on the number of splits actual

Re: Access to local filesystem working folder in map task

2009-05-19 Thread Tom White
Hi Chris, The task-attempt local working folder is actually just the current working directory of your map or reduce task. You should be able to pass your legacy command line exe and other files using the -files option (assuming you are using the Java interface to write your job, and you are imple

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Tom White
On Mon, May 18, 2009 at 11:44 AM, Steve Loughran wrote: > Grace wrote: >> >> To follow up this question, I have also asked help on Jrockit forum. They >> kindly offered some useful and detailed suggestions according to the JRA >> results. After updating the option list, the performance did become

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
within the cluster (and resolve to public ip addresses from outside). > > The only data transfer that I would incur while submitting jobs from outside > is the cost of copying the jar files and any other files meant for the > distributed cache). That would be extremely small. > > >

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
rk > just fine. I looked at the job.xml files of jobs submitted locally and > remotely and don't see any relevant differences. > > Totally foxed now. > > Joydeep > > -Original Message- > From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] > Sent: Wednesd

Re: HDFS to S3 copy problems

2009-05-12 Thread Tom White
hese two > distributed reads vs a distributed read and a local write then local read. > > What do you think? > > Cheers, > Ian Nowland > Amazon.com > > -Original Message- > From: Tom White [mailto:t...@cloudera.com] > Sent: Friday, May 08, 2009 1:36 AM > To: co

Re: Mixing s3, s3n and hdfs

2009-05-08 Thread Tom White
Hi Kevin, The s3n filesystem treats each file as a single block, however you may be able to split files by setting the number of mappers appropriately (or setting mapred.max.split.size in the new MapReduce API in 0.20.0). S3 supports range requests, and the s3n implementation uses them, so it woul

Re: HDFS to S3 copy problems

2009-05-08 Thread Tom White
Perhaps we should revisit the implementation of NativeS3FileSystem so that it doesn't always buffer the file on the client. We could have an option to make it write directly to S3. Thoughts? Regarding the problem with HADOOP-3733, you can work around it by setting fs.s3.awsAccessKeyId and fs.s3.aw

Re: All keys went to single reducer in WordCount program

2009-05-08 Thread Tom White
> mapred.reduce.tasks 1 You've only got one reduce task, as Jason correctly surmised. Try setting it using -D mapred.reduce.tasks=2 when you run your job, or by calling JobConf#setNumReduceTasks() Tom On Fri, May 8, 2009 at 7:46 AM, Foss User wrote: > On Thu, May 7, 2009 at 9:45 PM, jason

Re: About Hadoop optimizations

2009-05-07 Thread Tom White
On Thu, May 7, 2009 at 6:05 AM, Foss User wrote: > Thanks for your response again. I could not understand a few things in > your reply. So, I want to clarify them. Please find my questions > inline. > > On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon wrote: >> On Wed, May 6, 2009 at 1:46 PM, Foss Use

Re: multi-line records and file splits

2009-05-06 Thread Tom White
Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split boun

Re: Using multiple FileSystems in hadoop input

2009-05-06 Thread Tom White
Hi Ivan, I haven't tried this combination, but I think it should work. If it doesn't it should be treated as a bug. Tom On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov wrote: > Greetings to all, > > Could anyone suggest if Paths from different FileSystems can be used as > input of Hadoop job? >

Re: move tasks to another machine on the fly

2009-05-06 Thread Tom White
Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount of memory that each

Re: large files vs many files

2009-05-06 Thread Tom White
Hi Sasha, As you say, HDFS appends are not yet working reliably enough to be suitable for production use. On the other hand, having lots of little files is bad for the namenode, and inefficient for MapReduce (see http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so it's best to av

Re: Specifying System Properties in the had

2009-04-30 Thread Tom White
Another way to do this would be to set a property in the Hadoop config itself. In the job launcher you would have something like: JobConf conf = ... conf.setProperty("foo", "test"); Then you can read the property in your map or reduce task. Tom On Thu, Apr 30, 2009 at 3:25 PM, Aaron Kimball w

Re: Patching and bulding produces no libcordio or libhdfs

2009-04-28 Thread Tom White
Have a look at the instructions on http://wiki.apache.org/hadoop/HowToRelease under the "Building" section. It tells you which environment settings and Ant targets you need to set. Tom On Tue, Apr 28, 2009 at 9:09 AM, Sid123 wrote: > > HI I have applied a small patch for version 0.20 to my old 0

Re: How to run many jobs at the same time?

2009-04-22 Thread Tom White
, nguyenhuynh.mr wrote: > Tom White wrote: > >> You need to start each JobControl in its own thread so they can run >> concurrently. Something like: >> >>     Thread t = new Thread(jobControl); >>     t.start(); >> >> Then poll the jobControl.allFinished()

Re: How to run many jobs at the same time?

2009-04-21 Thread Tom White
You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new Thread(jobControl); t.start(); Then poll the jobControl.allFinished() method. Tom On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr wrote: > Hi all! > > > I have some jobs: j

Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-16 Thread Tom White
Not sure if will affect your findings, but when you read from a FSDataInputStream you should see how many bytes were actually read by inspecting the return value and re-read if it was fewer than you want. See Hadoop's IOUtils readFully() method. Tom On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelma

Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Tom White
Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball wrote: > Hi all, > > I'm stumped as to how to use the distributed cache's classpath feature. I > have a library of Ja

Re: RecordReader design heuristic

2009-03-18 Thread Tom White
other format > that works better with MR. If anyone has any ideas on what file formats > works best for storing and processing large amounts of time series > points with MR, I'm all ears. We're moving towards a new philosophy wrt > big data so it's a good time for us to exami

Re: RecordReader design heuristic

2009-03-18 Thread Tom White
Hi Josh, The other aspect to think about when writing your own record reader is input splits. As Jeff mentioned you really want mappers to be processing about one HDFS block's worth of data. If your inputs are significantly smaller, the overhead of creating mappers will be high and your jobs will

Re: Problem with com.sun.pinkdots.LogHandler

2009-03-17 Thread Tom White
Hi Paul, Looking at the stack trace, the exception is being thrown from your map method. Can you put some debugging in there to diagnose it? Detecting and logging the size of the array and the index you are trying to access should help. You can write to standard error and look in the task logs. An

Re: Support for zipped input files

2009-03-10 Thread Tom White
Hi Ken, Unfortunately, Hadoop doesn't yet support MapReduce on zipped files (see https://issues.apache.org/jira/browse/HADOOP-1824), so you'll need to write a program to unzip them and write them into HDFS first. Cheers, Tom On Tue, Mar 10, 2009 at 4:11 AM, jason hadoop wrote: > Hadoop has supp

Re: Hadoop AMI for EC2

2009-03-05 Thread Tom White
Hi Richa, Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2. Tom On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal wrote: > Hi All, > Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? > > Thanks, > Richa Khandelwal > > > University Of California, > Santa Cruz.

Re: contrib EC2 with hadoop 0.17

2009-03-05 Thread Tom White
I haven't used Eucalyptus, but you could start by trying out the Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your Eucalyptus installation. Cheers, Tom On Tue, Mar 3, 2009 at 2:51 PM, falcon164 wrote: > > I am new to hadoop. I want to run hadoop on eucalyptus. Please let me

Re: MapReduce jobs with expensive initialization

2009-03-02 Thread Tom White
On any particular tasktracker slot, task JVMs are shared only between tasks of the same job. When the job is complete the task JVM will go away. So there is certainly no sharing between jobs. I believe the static singleton approach outlined by Scott will work since the map classes are in a single

Re: OutOfMemory error processing large amounts of gz files

2009-02-25 Thread Tom White
Do you experience the problem with and without native compression? Set hadoop.native.lib to false to disable native compression. Cheers, Tom On Tue, Feb 24, 2009 at 9:40 PM, Gordon Mohr wrote: > If you're doing a lot of gzip compression/decompression, you *might* be > hitting this 6+-year-old Su

Re: How do you remove a machine from the cluster? Slaves file not working...

2009-02-17 Thread Tom White
The decommission process is for data nodes - which you are not running. Have a look at the mapred.hosts.exclude property for how to exclude tasktrackers. Tom On Tue, Feb 17, 2009 at 5:31 PM, S D wrote: > Thanks for your response. For clarification, I'm using S3 Native instead of > HDFS. Hence, I

Re: Reporter for Hadoop Streaming?

2009-02-11 Thread Tom White
You can retrieve them from the command line using bin/hadoop job -counter Tom On Wed, Feb 11, 2009 at 12:20 AM, scruffy323 wrote: > > Do you know how to access those counters programmatically after the job has > run? > > > S D-5 wrote: >> >> This does it. Thanks! >> >> On Thu, Feb 5, 2009 at

Re: can't read the SequenceFile correctly

2009-02-06 Thread Tom White
Hi Mark, Not all the bytes stored in a BytesWritable object are necessarily valid. Use BytesWritable#getLength() to determine how much of the buffer returned by BytesWritable#getBytes() to use. Tom On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner wrote: > Hi, > > I have written binary files to a Se

Re: Problem with Counters

2009-02-05 Thread Tom White
gt;> to be >>>> >> 0 always. >>>> >> >>>> >>RunningJob running = JobClient.runJob(conf); >>>> >> >>>> >> Counters ct = new Counters(); >>>> >> ct = runni

Re: Problem with Counters

2009-02-05 Thread Tom White
Hi Sharath, The code you posted looks right to me. Counters#getCounter() will return the counter's value. What error are you getting? Tom On Thu, Feb 5, 2009 at 10:09 AM, some speed wrote: > Hi, > > Can someone help me with the usage of counters please? I am incrementing a > counter in Reduce m

Re: hadoop to ftp files into hdfs

2009-02-03 Thread Tom White
NLineInputFormat is ideal for this purpose. Each split will be N lines of input (where N is configurable), so each mapper can retrieve N files for insertion into HDFS. You can set the number of redcers to zero. Tom On Tue, Feb 3, 2009 at 4:23 AM, jason hadoop wrote: > If you have a large number

Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-03 Thread Tom White
Hi Brian, Writes to HDFS are not guaranteed to be flushed until the file is closed. In practice, as each (64MB) block is finished it is flushed and will be visible to other readers, which is what you were seeing. The addition of appends in HDFS changes this and adds a sync() method to FSDataOutpu

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
y, can multiple MapReduce workers read the same SequenceFile > simultaneously? > > On Mon, Feb 2, 2009 at 9:42 AM, Tom White wrote: > >> Is there any reason why it has to be a single SequenceFile? You could >> write a local program to write several block compressed Sequenc

Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Tom White
The SequenceFile format is described here: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html. The format of the keys and values depends on the serialization classes used. For example, BytesWritable writes out the length of its byte array followed by the actual by

Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
Is there any reason why it has to be a single SequenceFile? You could write a local program to write several block compressed SequenceFiles in parallel (to HDFS), each containing a portion of the files on your PC. Tom On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner wrote: > Truly, I do not see any

Re: MapFile.Reader and seek

2009-02-02 Thread Tom White
You can use the get() method to seek and retrieve the value. It will return null if the key is not in the map. Something like: Text value = (Text) indexReader.get(from, new Text()); while (value != null && ...) Tom On Thu, Jan 29, 2009 at 10:45 PM, schnitzi wrote: > > Greetings all... I have a

Re: tools for scrubbing HDFS data nodes?

2009-01-29 Thread Tom White
Each datanode has a web page at http://datanode:50075/blockScannerReport where you can see details about the scans. Tom On Thu, Jan 29, 2009 at 7:29 AM, Raghu Angadi wrote: > Owen O'Malley wrote: >> >> On Jan 28, 2009, at 6:16 PM, Sriram Rao wrote: >> >>> By "scrub" I mean, have a tool that read

Re: Distributed cache testing in local mode

2009-01-23 Thread Tom White
It would be nice to make this more uniform. There's an outstanding Jira on this if anyone is interested in looking at it: https://issues.apache.org/jira/browse/HADOOP-2914 Tom On Fri, Jan 23, 2009 at 12:14 AM, Aaron Kimball wrote: > Hi Bhupesh, > > I've noticed the same problem -- LocalJobRunner

Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
; I suppose this would accomplish the same thing? > > > > -Original Message- > From: Tom White [mailto:t...@cloudera.com] > Sent: Thursday, January 22, 2009 10:41 AM > To: core-user@hadoop.apache.org > Subject: Re: Set the Order of the Keys in Reduce > > Hi Brian, > > The

Re: Archive?

2009-01-22 Thread Tom White
Hi Mark, The archives are listed on http://wiki.apache.org/hadoop/MailingListArchives Tom On Thu, Jan 22, 2009 at 3:41 PM, Mark Kerzner wrote: > Hi, > is there an archive to the messages? I am a newcomer, granted, but google > groups has all the discussion capabilities, and it has a searchable

Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Hi Brian, The CAT_A and CAT_B keys will be processed by different reducer instances, so they run independently and may run in any order. What's the output that you're trying to get? Cheers, Tom On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay wrote: > Hello, > > > > Any tips would be greatly appre

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Tom White
Hi Matthias, It is not necessary to have SSH set up to run Hadoop, but it does make things easier. SSH is used by the scripts in the bin directory which start and stop daemons across the cluster (the slave nodes are defined in the slaves file), see the start-all.sh script as a starting point. Thes

Re: @hadoop on twitter

2009-01-16 Thread Tom White
Thanks flip. I've signed up for the hadoop account - be great to get some help with getting it going. Tom On Wed, Jan 14, 2009 at 6:33 AM, Philip (flip) Kromer wrote: > Hey all, > There is no @hadoop on twitter, but there should be. > http://twitter.com/datamapper and http://twitter.com/rails b

Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Tom White
LZO was removed due to license incompatibility: https://issues.apache.org/jira/browse/HADOOP-4874 Tom On Wed, Jan 14, 2009 at 11:18 AM, Gert Pfeifer wrote: > I got it. For some reason getDefaultExtension() returns ".lzo_deflate". > > Is that a bug? Shouldn't it be .lzo? > > In the head revision

Re: Problem with Hadoop and concatenated gzip files

2009-01-12 Thread Tom White
I've opened https://issues.apache.org/jira/browse/HADOOP-5014 for this. Do you get this behaviour when you use the native libraries? Tom On Sat, Jan 10, 2009 at 12:26 AM, Oscar Gothberg wrote: > Hi, > > I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully > processing certain g

Re: Concatenating PDF files

2009-01-05 Thread Tom White
Hi Richard, Are you running out of memory after many PDFs have been processed by one mapper, or during the first? The former would suggest that memory isn't being released; the latter that the task VM doesn't have enough memory to start with. Are you setting the memory available to map tasks by s

Re: Predefined counters

2008-12-22 Thread Tom White
Hi Jim, Try something like: Counters counters = job.getCounters(); counters.findCounter("org.apache.hadoop.mapred.Task$Counter", "REDUCE_INPUT_RECORDS").getCounter() The pre-defined counters are unfortunately not public and are not in one place in the source code, so you'll need to hunt to find

Re: EC2 Usage?

2008-12-18 Thread Tom White
Hi Ryan, The ec2-describe-instances command in the API tool reports the launch time for each instance, so you could work out the machine hours of your cluster using that information. Tom On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte wrote: > Hello all, > > Somewhat of a an off-topic related qu

Re: contrib/ec2 USER_DATA not used

2008-12-18 Thread Tom White
Hi Stefan, The USER_DATA line is a hangover from the way that these parameters used to be passed to the node. This line can safely be removed, since the scripts now pass the data in the USER_DATA_FILE as you rightly point out. Tom On Thu, Dec 18, 2008 at 10:09 AM, Stefan Groschupf wrote: > Hi,

Re: API Documentation question - WritableComparable

2008-12-16 Thread Tom White
I've opened https://issues.apache.org/jira/browse/HADOOP-4881 and attached a patch to fix this. Tom On Fri, Dec 12, 2008 at 2:18 AM, Tarandeep Singh wrote: > The example is just to illustrate how one should implement one's own > WritableComparable class and in the compreTo method, it is just sho

Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-11 Thread Tom White
You can also see the logs from the web UI (http://:50030 by default), by clicking through to the map or reduce task that you are interested in and looking at the page for task attempts. Tom On Wed, Dec 10, 2008 at 10:41 PM, Tarandeep Singh <[EMAIL PROTECTED]> wrote: > you can see the output in ha

Re: JobConf-to-XML

2008-12-03 Thread Tom White
There's a writeXml() method (or just write() in earlier releases) on Configuration which should do what you need. Also see Configuration's main() method. Tom On Wed, Dec 3, 2008 at 8:39 AM, Johannens Zillmann <[EMAIL PROTECTED]> wrote: > Hi everybody, > > does anybody know if there exists a tool

Re: Auto-shutdown for EC2 clusters

2008-11-26 Thread Tom White
I've just created a basic script to do something similar for running a benchmark on EC2. See https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the code for detecting when Hadoop is ready to accept jobs is simplistic, to say the least, so any ideas for improvement would be great. Than

Google Terasort Benchmark

2008-11-22 Thread Tom White
>From the Google Blog, http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html "We are excited to announce we were able to sort 1TB (stored on the Google File System as 10 billion 100-byte records in uncompressed text files) on 1,000 computers in 68 seconds. By comparison, the previ

Re: Hadoop Book

2008-09-16 Thread Tom White
;: > waiting for it!!! > > 2008/9/5, Owen O'Malley <[EMAIL PROTECTED]>: >> >> >> On Sep 4, 2008, at 6:36 AM, 叶双明 wrote: >> >> what book? >>> >> >> To summarize, Tom White is writing a book about Hadoop. He will post a >> message to the list when a draft is ready. >> >> -- Owen >

Re: Parameterized deserializers?

2008-09-12 Thread Tom White
If you make your Serialization implement Configurable it will be given a Configuration object that it can pass to the Deserializer on construction. Also, this thread may be related: http://www.nabble.com/Serialization-with-additional-schema-info-td19260579.html Tom On Sat, Sep 13, 2008 at 12:38

Re: Hadoop & EC2

2008-09-04 Thread Tom White
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > I'm noticing that using bin/hadoop fs -put ... svn://... is uploading > multi-gigabyte files in ~64MB chunks. That's because S3Filesystem stores files as 64MB blocks on S3. > Then, when this is copied from > S3 into HDFS u

Re: EC2 AMI for Hadoop 0.18.0

2008-09-03 Thread Tom White
I've just created public AMIs for 0.18.0. Note that they are in the hadoop-images bucket. Tom On Fri, Aug 29, 2008 at 9:22 PM, Karl Anderson <[EMAIL PROTECTED]> wrote: > > On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote: > >> Anybody have one? Any success building it with create-hadoop-image? >> T

Re: Hadoop & EC2

2008-09-03 Thread Tom White
ut it looks like a natural fit. > > Thanks! > > Ryan > > > On Wed, Sep 3, 2008 at 9:54 AM, Tom White <[EMAIL PROTECTED]> wrote: >> There's a case study with some numbers in it from a presentation I >> gave on Hadoop and AWS in London last month, whic

Re: Hadoop Book

2008-09-03 Thread Tom White
Lukáš, Feris, I'll be sure to post a message to the list when the book's available as a Rough Cut. Tom 2008/8/28 Feris Thia <[EMAIL PROTECTED]>: > Agree... > > I will be glad to be early notified about the release :) > > Regards, > > Feris > > 2008/8/29 Lukáš Vlček <[EMAIL PROTECTED]> > >> Tom, >

Re: Hadoop & EC2

2008-09-03 Thread Tom White
There's a case study with some numbers in it from a presentation I gave on Hadoop and AWS in London last month, which you may find interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf. tim robertson <[EMAIL PROTECTED]> wrote: > For these small > datasets, you might find it useful

Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-03 Thread Tom White
For the s3:// filesystem, files are split into 64MB blocks which are sent to S3 individually. Rather than increase the jets3t.properties retry buffer and retry count, it is better to change the Hadoop properties fs.s3.maxRetries and fs.s3.sleepTimeSeconds, since the Hadoop-level retry mechanism ret

Re: Reading and writing Thrift data from MapReduce

2008-09-03 Thread Tom White
Hi Juho, I think you should be able to use the Thrift serialization stuff that I've been working on in https://issues.apache.org/jira/browse/HADOOP-3787 - at least as a basis. Since you are not using sequence files, you will need to write an InputFormat (probably one that extends FileInputFormat)

Re: Hadoop Book

2008-08-28 Thread Tom White
tter.com/custom/presentations/ec2-talk.pdf) > that Tom White is working on Hadoop book now. > > Lukas > > 2008/8/26 Feris Thia <[EMAIL PROTECTED]> > >> Hi Lukas, >> >> I've check on Youtube.. and yes, there are many explanations on Hadoop. >> >

Re: Namenode Exceptions with S3

2008-07-17 Thread Tom White
On Thu, Jul 17, 2008 at 6:16 PM, Doug Cutting <[EMAIL PROTECTED]> wrote: > Can't one work around this by using a different configuration on the client > than on the namenodes and datanodes? The client should be able to set > fs.default.name to an s3: uri, while the namenode and datanode must have

Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Fri, Jul 11, 2008 at 9:09 PM, slitz <[EMAIL PROTECTED]> wrote: > a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket > -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode: > java.lang.RuntimeException: Not a host:port pair: X What command are you using t

Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter <[EMAIL PROTECTED]> wrote: > Thank you, Tom. > > Forgive me for being dense, but I don't understand your reply: > Sorry! I'll try to explain it better (see below). > > Do you mean that it is possible to use the Hadoop daemons with S3 but > the defa

Re: Namenode Exceptions with S3

2008-07-10 Thread Tom White
> I get (where the all-caps portions are the actual values...): > > 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode: > java.lang.NumberFormatException: For input string: > "[EMAIL PROTECTED]" >at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:48) >

Re: Hadoop on EC2 + S3 - best practice?

2008-07-01 Thread Tom White
Hi Tim, The steps you outline look about right. Because your file is >5GB you will need to use the S3 block file system, which has a s3 URL. (See http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build your own AMI unless you have dependencies that can't be submitted as a part of the M

Tasktrackers job cache directories not always cleaned up

2008-07-01 Thread Tom White
The task subdirectories are being deleted, but the job directory and its work subdirectory are not. This is causing a problem since disk space is filling up over time, and restarting the cluster after a long time is very slow as the tasktrackers clear out the jobcache directories. This doesn't hap

Re: hadoop on Solaris

2008-06-17 Thread Tom White
I've successfully run Hadoop on Solaris 5.10 (on Intel). The path included /usr/ucb so whoami was picked up correctly. Satoshi, you say you added /usr/ucb to you path too, so I'm puzzled why you get a LoginException saying "whoami: not found" - did you export your changes to path? I've also manag

Re: distcp/ls fails on Hadoop-0.17.0 on ec2.

2008-05-31 Thread Tom White
Hi Einar, How did you put the data onto S3, using Hadoop's S3 FileSystem or using other S3 tools? If it's the latter then it won't work as the s3 scheme is for Hadoop's block-based S3 storage. Native S3 support is coming - see https://issues.apache.org/jira/browse/HADOOP-930, but it's not integrat

Re: Hadoop 0.17 AMI?

2008-05-22 Thread Tom White
Hi Jeff, I've built two public 0.17.0 AMIs (32-bit and 64-bit), so you should be able to use the 0.17 scripts to launch them now. Cheers, Tom On Thu, May 22, 2008 at 6:37 AM, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > Hi Jeff, > > 0.17.0 was released yesterday, from what I can tell. > > > Oti

Re: Hadoop 0.17 AMI?

2008-05-14 Thread Tom White
Hi Jeff, There is no public 0.17 AMI yet - we need 0.17 to be released first. So in the meantime you'll have to build your own. Tom On Wed, May 14, 2008 at 8:36 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > I'm trying to bring up a cluster on EC2 using > (http://wiki.apache.org/hadoop/AmazonEC2)

  1   2   >