Re: Configuration clone constructor not cloning classloader

2013-04-19 Thread Tom White
Hi Amit, It is a bug, fixed by https://issues.apache.org/jira/browse/HADOOP-6103, although the fix never made it into branch-1. Can you create a branch-1 patch for this please? Thanks, Tom On Thu, Apr 18, 2013 at 4:09 AM, Amit Sela am...@infolinks.com wrote: Hi all, I was wondering if there

Re: Problem running Hadoop 0.23.0

2011-11-28 Thread Tom White
Hi Nitin, It looks like you may be using the wrong port number - try 8088 for the resource manager UI. Cheers, Tom On Mon, Nov 28, 2011 at 4:02 AM, Nitin Khandelwal nitin.khandel...@germinait.com wrote: Hi, I was trying to setup Hadoop 0.23.0 with help of

Re: Skipping Bad Records

2011-10-13 Thread Tom White
Justin, The skipping feature should really only be used when you are calling out to a third-party library that may segfault on corrupt data, and even then it's probably better to use a subprocess to handles it, as Owen suggested here:

Re: cannot use distcp in some s3 buckets

2011-10-13 Thread Tom White
On Thu, Oct 13, 2011 at 2:06 PM, Raimon Bosch raimon.bo...@gmail.com wrote: By the way, The url I'm trying has a '_' in the bucket name. Could be this the problem? Yes, underscores are not permitted in hostnames. Cheers, Tom 2011/10/13 Raimon Bosch raimon.bo...@gmail.com Hi, I've been

Re: updated example

2011-10-11 Thread Tom White
JobConf and the old API are no longer deprecated in the forthcoming 0.20.205 release, so you can continue to use it without issue. The equivalent in the new API is setInputFormatClass() on org.apache.hadoop.mapreduce.Job. Cheers, Tom On Tue, Oct 11, 2011 at 9:18 AM, Keith Thompson

Re: Distributed cluster filesystem on EC2

2011-08-31 Thread Tom White
You might consider Apache Whirr (http://whirr.apache.org/) for bringing up Hadoop clusters on EC2. Cheers, Tom On Wed, Aug 31, 2011 at 8:22 AM, Robert Evans ev...@yahoo-inc.com wrote: Dmitry, It sounds like an interesting idea, but I have not really heard of anyone doing it before.  It

Re: LocalJobRunner and # of reducers

2011-05-02 Thread Tom White
See also https://issues.apache.org/jira/browse/MAPREDUCE-434 which has a patch for this issue. Cheers, Tom On Mon, May 2, 2011 at 5:13 PM, jason urg...@gmail.com wrote: I am attaching the originals so you could figure out the diffs on your own :) On 5/2/11, Dmitriy Lyubimov dlie...@gmail.com

Re: 0.21.0 - Java Class Error

2011-04-08 Thread Tom White
Hi Witold, Is this on Windows? The scripts were re-structured after Hadoop 0.20, and looking at them now I notice that the cygwin path translation for the classpath seems to be missing. You could try adding the following line to the if $cygwin clause in bin/hadoop-config.sh: CLASSPATH=`cygpath

Re: hadoop installation problem(single-node)

2011-03-02 Thread Tom White
The instructions at http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html should be what you need. Cheers, Tom On Wed, Mar 2, 2011 at 12:59 AM, Manish Yadav manish.ya...@orkash.com wrote: Dear Sir/Madam  I'm very new to hadoop. I'm trying to install hadoop on my computer. I followed a

Re: Missing files in the trunk ??

2011-02-28 Thread Tom White
These files are generated files. If you run ant avro-generate eclipse then Eclipse should file these files. Cheers, Tom On Mon, Feb 28, 2011 at 2:43 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi all, I checked out the map-reduce trunk a few days back  and following

Re: 0.21 found interface but class was expected

2010-11-15 Thread Tom White
Hi Steve, Sorry to hear about the problems you had. The issue you hit was a result of MAPREDUCE-954, and there was some discussion on that JIRA about compatibility. I believe the thinking was that the context classes are framework classes, so users don't extend/implement them in the normal course

Re: How to stop a mapper within a map-reduce job when you detect bad input

2010-10-21 Thread Tom White
On Thu, Oct 21, 2010 at 8:23 AM, ed hadoopn...@gmail.com wrote: Hello, The MapRunner classes looks promising.  I noticed it is in the deprecated mapred package but I didn't see an equivalent class in the mapreduce package.  Is this going to ported to mapreduce or is it no longer being

Re: Gzipped input files

2010-10-08 Thread Tom White
It's done by the RecordReader. For text-based input formats, which use LineRecordReader, decompression is carried out automatically. For others it's not (e.g. sequence files which have internal compression). So it depends on what your custom input format does. Cheers, Tom On Fri, Oct 8, 2010 at

Re: Too large class path for map reduce jobs

2010-10-05 Thread Tom White
Hi Henning, I don't know if you've seen https://issues.apache.org/jira/browse/MAPREDUCE-1938 and https://issues.apache.org/jira/browse/MAPREDUCE-1700 which have discussion about this issue. Cheers Tom On Fri, Sep 24, 2010 at 3:41 AM, Henning Blohm henning.bl...@zfabrik.de wrote: Short update

Re: JobClient using deprecated JobConf

2010-09-23 Thread Tom White
not find any tutorial or examples anywhere. Martin On 22.09.2010 18:29, Tom White wrote: Note that JobClient, along with the rest of the old API in org.apache.hadoop.mapred, has been undeprecated in Hadoop 0.21.0 so you can continue to use it without warnings. Tom On Wed, Sep 22, 2010 at 2

Re: 0.21.0 API

2010-09-22 Thread Tom White
the Tool interface is located. Could this be the problem? I am a little clueless here and not sure whether this is a problem that should be further addressed in this mailing list. Thanks in advance, Martin On 22.09.2010 16:08, Tom White wrote: Hi Martin, Neither Tool nor ToolRunner

Re: start-{dfs,mapred}.sh Hadoop common not found

2010-09-22 Thread Tom White
Hi Martin, This is a known bug, see https://issues.apache.org/jira/browse/HADOOP-6953. Cheers Tom On Wed, Sep 22, 2010 at 8:17 AM, Martin Becker _martinbec...@web.de wrote:  Hi, I am using Hadoop MapReduce 0.21.0. The usual process of starting Hadoop/HDFS/MapReduce was to use the

Re: JobClient using deprecated JobConf

2010-09-22 Thread Tom White
dar...@darose.net wrote: Hmmm.  Any idea as to why the undeprecation?  I thought the intention was to try to move everybody to the new API.  Why the reversal? Thanks, DR On 09/22/2010 12:29 PM, Tom White wrote: Note that JobClient, along with the rest of the old API

Re: JobClient using deprecated JobConf

2010-09-22 Thread Tom White
On 22.09.2010 18:29, Tom White wrote: Note that JobClient, along with the rest of the old API in org.apache.hadoop.mapred, has been undeprecated in Hadoop 0.21.0 so you can continue to use it without warnings. Tom On Wed, Sep 22, 2010 at 2:43 AM, Amareshwari Sri Ramadasu amar...@yahoo-inc.com  wrote

Re: Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/fs/FsShell

2010-09-15 Thread Tom White
Hi Mike, What do you get if you type ./hadoop classpath? Does it contain the Hadoop common JAR? To avoid the deprecation warning you should use hadoop fs, not hadoop dfs. Tom On Wed, Sep 15, 2010 at 12:53 PM, Mike Franon kongfra...@gmail.com wrote: Hi, I just setup 3 node hadoop cluster

Re: Hadoop 0.21.0 release Maven repo

2010-09-10 Thread Tom White
Hi Sonal, The 0.21.0 jars are not available in Maven yet, since the process for publishing them post split has changed. See HDFS-1292 and MAPREDUCE-1929. Cheers, Tom On Fri, Sep 10, 2010 at 1:33 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Hi, Can someone please point me to the Maven repo

Re: Ivy

2010-09-03 Thread Tom White
The 0.21.0 jars are not in the Apache Maven repos yet, since the process for publishing them post split has changed. HDFS-1292 and MAPREDUCE-1929 are the tickets to fix this. Cheers, Tom On Sat, Aug 28, 2010 at 9:10 PM, Mark static.void@gmail.com wrote:  On 8/27/10 9:25 AM, Owen O'Malley

[ANNOUNCE] Apache Hadoop 0.21.0 released

2010-08-24 Thread Tom White
Hi everyone, I am pleased to announce that Apache Hadoop 0.21.0 is available for download from http://hadoop.apache.org/common/releases.html. Over 1300 issues have been addressed since 0.20.2; you can find details at http://hadoop.apache.org/common/docs/r0.21.0/releasenotes.html

Re: Null mapper?

2010-08-18 Thread Tom White
On Mon, Aug 16, 2010 at 3:21 PM, David Rosenstrauch dar...@darose.net wrote: On 08/16/2010 05:48 PM, Ted Yu wrote: No. On Mon, Aug 16, 2010 at 1:25 PM, David Rosenstrauchdar...@darose.netwrote: Is it possible for a M/R job to have no mapper?  i.e.: job.setMapperClass(null)?  Or is it

Re: Implementing S3FileSystem#append

2010-08-12 Thread Tom White
Hi Oleg, I don't know of any plans to implement this. However, since this is a block-based storage system which uses S3, I wonder whether an implementation could use some of the logic in HDFS for block storage and append in general. Cheers, Tom On Thu, Aug 12, 2010 at 8:34 AM, Aleshko, Oleg

Re: Hadoop 0.21 :: job.getCounters() returns null?

2010-07-07 Thread Tom White
Hi Felix, Aaron Kimball hit the same problem - it's being discussed at https://issues.apache.org/jira/browse/MAPREDUCE-1920. Thanks for reporting this. Cheers, Tom On Tue, Jul 6, 2010 at 11:26 AM, Felix Halim felix.ha...@gmail.com wrote: I tried hadoop 0.21 release candidate.

Re: Next Release of Hadoop version number and Kerberos

2010-07-07 Thread Tom White
Hi Ananth, The next release of Hadoop will be 0.21.0, but it won't have Kerberos authentication in it (since it's not all in trunk yet). The 0.22.0 release later this year will have a working version of security in it. Cheers, Tom On Wed, Jul 7, 2010 at 8:09 AM, Ananth Sarathy

Re: Cloudera EC2 scripts

2010-05-28 Thread Tom White
Hi Mark, You can find the latest version of the scripts at http://archive.cloudera.com/cdh/3/hadoop-0.20.2+228.tar.gz. Documentation is at http://archive.cloudera.com/docs/ec2.html. The source code is currently in src/contrib/cloud in Hadoop Common, but is in the process of moving to a new

Re: problem w/ data load

2010-05-03 Thread Tom White
Hi Susanne, Hadoop uses the file extension to detect that a file is compressed. I believe Hive does too. Did you store the compressed file in HDFS with a .gz extension? Cheers, Tom BTW It's best to send Hive questions like these to the hive-user@ list. On Sun, May 2, 2010 at 11:22 AM, Susanne

Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20

2010-04-29 Thread Tom White
Hi Yuanyuan, I think you've found a bug - could you file a JIRA issue for this please? Thanks, Tom On Wed, Apr 28, 2010 at 11:04 PM, Yuanyuan Tian yt...@us.ibm.com wrote: I have a problem in getting the input file name in the mapper  when uisng MultipleInputs. I need to use MultipleInputs

Re: conf.get(map.input.file) returns null when using MultipleInputs in Hadoop 0.20

2010-04-29 Thread Tom White
mapper)? Yuanyuan Tom White ---04/29/2010 09:42:44 AM---Hi Yuanyuan, I think you've found a bug - could you file a JIRA issue for this please? From: Tom White t...@cloudera.com To: common-user@hadoop.apache.org Date: 04/29/2010 09:42 AM Subject: Re: conf.get(map.input.file) returns

Re: File permissions on S3FileSystem

2010-04-22 Thread Tom White
Hi Danny, S3FileSystem has no concept of permissions, which is why this check fails. The change that introduced the permissions check was introduced in https://issues.apache.org/jira/browse/MAPREDUCE-181. Could you file a bug for this please? Cheers, Tom On Thu, Apr 22, 2010 at 4:16 AM, Danny

Re: JobConf.setJobEndNotificationURI

2010-03-23 Thread Tom White
I think you can set the URI on the configuration object with the key JobContext.END_NOTIFICATION_URL. Cheers, Tom On Tue, Feb 23, 2010 at 12:02 PM, Ted Yu yuzhih...@gmail.com wrote: Hi, I am looking for counterpart to JobConf.setJobEndNotificationURI() in org.apache.hadoop.mapreduce Please

Re: Cloudera AMIs

2010-03-15 Thread Tom White
Hi Sonal, You should use the one with the later date. The Cloudera AMIs don't actually have Hadoop installed on them, just Java and some other base packages. Hadoop is installed at start up time; you can find more information at http://archive.cloudera.com/docs/ec2.html. Cheers, Tom P.S. For

Re: Is it possible to share a key across maps?

2010-01-14 Thread Tom White
Please submit a patch for the documentation change - perhaps at https://issues.apache.org/jira/browse/HADOOP-5973. Cheers, Tom On Wed, Jan 13, 2010 at 12:09 AM, Amogh Vasekar am...@yahoo-inc.com wrote: +1 for the documentation change in mapred-tutorial. Can we do that and publish using a

Re: Implementing VectorWritable

2009-12-29 Thread Tom White
Have a look at org.apache.hadoop.io.ArrayWritable. You may be able to use this class in your application, or at least use it as a basis for writing VectorWritable. Cheers, Tom On Tue, Dec 29, 2009 at 1:37 AM, bharath v bharathvissapragada1...@gmail.com wrote: Can you please tell me , what is

Re: Configuration for Hadoop running on Amazon S3

2009-12-17 Thread Tom White
If you are using S3 as your file store then you don't need to run HDFS (and indeed HDFS will not start up if you try). Cheers, Tom 2009/12/17 Rekha Joshi rekha...@yahoo-inc.com: Not sure what the whole error is, but you can always alternatively try this - property  namefs.default.name/name  

Re: Master and slaves on hadoop/ec2

2009-11-25 Thread Tom White
Correct. The master runs the namenode and jobtracker, but not a datanode or tasktracker. Tom On Tue, Nov 24, 2009 at 4:57 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, do I understand it correctly that, when I launch a Hadoop cluster on EC2, the master will not be doing any work, and it

Re: How do I reference S3 from an EC2 Hadoop cluster?

2009-11-25 Thread Tom White
that of the slaves? No, this is not supported, but I can see it would be useful, particularly for larger clusters. Please consider opening a JIRA for it. Cheers, Tom Thank you, Mark On Tue, Nov 24, 2009 at 11:20 PM, Tom White t...@cloudera.com wrote: Mark, If the data was transferred to S3

Re: How do I reference S3 from an EC2 Hadoop cluster?

2009-11-24 Thread Tom White
Mark, If the data was transferred to S3 outside of Hadoop then you should use the s3n filesystem scheme (see the explanation on http://wiki.apache.org/hadoop/AmazonS3 for the differences between the Hadoop S3 filesystems). Also, some people have had problems embedding the secret key in the URI,

Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?

2009-11-15 Thread Tom White
Hi Mark, HADOOP-6108 will add Cloudera's EC2 scripts to the Apache distribution, with the difference that they will run Apache Hadoop. The same scripts will also support Cloudera's Distribution for Hadoop, simply by using a different boot script on the instances. So I would suggest you use these

Re: Apache Hadoop and Fedora, or Clouder Hadoop and Ubuntu?

2009-11-15 Thread Tom White
, Mark On Sun, Nov 15, 2009 at 10:29 PM, Tom White t...@cloudera.com wrote: Hi Mark, HADOOP-6108 will add Cloudera's EC2 scripts to the Apache distribution, with the difference that they will run Apache Hadoop. The same scripts will also support Cloudera's Distribution for Hadoop, simply

Re: Confused by new API MultipleOutputFormats using Hadoop 0.20.1

2009-11-08 Thread Tom White
Multiple outputs has been ported to the new API in 0.21. See https://issues.apache.org/jira/browse/MAPREDUCE-370. Cheers, Tom On Sat, Nov 7, 2009 at 6:45 AM, Xiance SI(司宪策) adam...@gmail.com wrote: I just fall back to old mapred.* APIs, seems MultipleOutputs only works for the old API.

Re: Multiple Input Paths

2009-11-08 Thread Tom White
MultipleInputs is available from Hadoop 0.19 onwards (in org.apache.hadoop.mapred.lib, or org.apache.hadoop.mapreduce.lib.input for the new API in later versions). Tom On Wed, Nov 4, 2009 at 8:07 AM, Mark Vigeant mark.vige...@riskmetrics.com wrote: Amogh, That sounds so awesome! Yeah I wish I

Re: Terminate Instances Terminating ALL EC2 Instances

2009-10-19 Thread Tom White
Hi Mark, Sorry to hear that all your EC2 instances were terminated. Needless to say, this should certainly not happen. The scripts are a Python rewrite (see HADOOP-6108) of the bash ones so HADOOP-1504 is not applicable, but the behaviour should be the same: the terminate-cluster command lists

Re: Terminate Instances Terminating ALL EC2 Instances

2009-10-19 Thread Tom White
the Hadoop cluster default, and make sure that you don't create non-Hadoop EC2 instances in the cluster group. Thanks, Tom Does this help at all?  Thanks. -Mark On Mon, Oct 19, 2009 at 11:52 AM, Tom White t...@cloudera.com wrote: Hi Mark, Sorry to hear that all your EC2 instances were terminated

Re: JobTracker startup failure when starting hadoop-0.20.0 cluster on Amazon EC2 with contrib/ec2 scripts

2009-09-07 Thread Tom White
Hi Jeyendran, Were there any errors reported in the datanode logs? There could be a problem with datanodes contacting the namenode, caused by firewall configuration problems (EC2 security groups). Cheers, Tom On Fri, Sep 4, 2009 at 12:17 AM, Jeyendran

Re: Can't find TestDFSIO

2009-08-24 Thread Tom White
Hi Cam, Looks like it's in hadoop-hdfs-hdfswithmr-test-0.21.0-dev.jar, which should be built with ant jar-test. Cheers, Tom On Mon, Aug 24, 2009 at 8:22 PM, Cam Macdonellc...@cs.ualberta.ca wrote: Thanks Danny, It currently does not show up hadoop-common-test, hadoop-hdfs-test or

Re: File Chunk to Map Thread Association

2009-08-20 Thread Tom White
Hi Roman, Have a look at CombineFileInputFormat - it might be related to what you are trying to do. Cheers, Tom On Thu, Aug 20, 2009 at 10:59 AM, roman kolcunroman.w...@gmail.com wrote: On Thu, Aug 20, 2009 at 10:30 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Thu, Aug 20,

Re: MapFile performance

2009-08-03 Thread Tom White
On Mon, Aug 3, 2009 at 3:09 AM, Billy Pearsonbilly_pear...@sbcglobal.net wrote: not sure if its still there but there was a parm in the hadoop-site conf file that would allow you to skip x number if index when reading it in to memory. This is io.map.index.skip (default 0), which will skip

Re: Status of 0.19.2

2009-08-03 Thread Tom White
I've now updated the news section, and the documentation on the website to reflect the 0.19.2 release. There were several reports of it being more stable than 0.19.1 in the voting thread: http://www.mail-archive.com/common-...@hadoop.apache.org/msg00051.html Cheers, Tom On Tue, Jul 28, 2009 at

Re: Reading GZIP input files.

2009-07-31 Thread Tom White
That's for the case where you want to do the decompression yourself, explicitly, perhaps when you are reading the data out of HDFS (and not using MapReduce). When using compressed data as input to a MapReduce job, Hadoop will automatically decompress them for you. Tom On Fri, Jul 31, 2009 at

Re: Using JobControl in hadoop

2009-07-17 Thread Tom White
Hi Raakhi, JobControl is designed to be run from a new thread: Thread t = new Thread(jobControl); t.start(); Then you can run a loop to poll for job completion and print out status: String oldStatus = null; while (!jobControl.allFinished()) { String status =

Re: access Configuration object in Partioner??

2009-07-14 Thread Tom White
Hi Jianmin, Partitioner extends JobConfigurable, so you can implement the configure() method to access the JobConf. Hope that helps. Cheers, Tom On Tue, Jul 14, 2009 at 10:27 AM, Jianmin Woojianmin_...@yahoo.com wrote: Hi, I am considering to implement a Partitioner that needs to access the

Re: more than one reducer in standalone mode

2009-07-14 Thread Tom White
There's a Jira to fix this here: https://issues.apache.org/jira/browse/MAPREDUCE-434 Tom On Mon, Jul 13, 2009 at 12:34 AM, jason hadoopjason.had...@gmail.com wrote: If the jobtracker is set to local, there is no way to have more than 1 reducer. On Sun, Jul 12, 2009 at 12:21 PM, Rares Vernica

Re: access Configuration object in Partioner??

2009-07-14 Thread Tom White
in 0.20. It seems that the org.apache.hadoop.mapred.Partitioner is deprecated and will be removed in the futture. Do you have some suggestions on this? Thanks, Jianmin From: Tom White t...@cloudera.com To: common-user@hadoop.apache.org Sent: Tuesday, July

Re: Restarting a killed job from where it left

2009-07-13 Thread Tom White
Hi Akhil, Have a look at the mapred.jobtracker.restart.recover property. Cheers, Tom On Sun, Jul 12, 2009 at 12:06 AM, akhil1988akhilan...@gmail.com wrote: HI All, I am looking for ways to restart my hadoop job from where it left when the entire cluster goes down or the job gets stopped

Re: Problem with setting up the cluster

2009-06-25 Thread Tom White
Have a look at the datanode log files on the datanode machines and see what the error is in there. Cheers, Tom On Thu, Jun 25, 2009 at 6:21 AM, .ke. sivakumarkesivaku...@gmail.com wrote: Hi all, I'm a student and I have been tryin to set up the hadoop cluster for a while but have been

Re: Unable to run Jar file in Hadoop.

2009-06-25 Thread Tom White
Hi Krishna, You get this error when the jar file cannot be found. It looks like /user/hadoop/hadoop-0.18.0-examples.jar is an HDFS path, when in fact it should be a local path. Cheers, Tom On Thu, Jun 25, 2009 at 9:43 AM, krishna prasannasvk_prasa...@yahoo.com wrote: Oh! thanks Shravan

Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
Hi Usman, Before the rebalancer was introduced one trick people used was to increase the replication on all the files in the system, wait for re-replication to complete, then decrease the replication to the original level. You can do this using hadoop fs -setrep. Cheers, Tom On Thu, Jun 25,

Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
You can change the value of hadoop.root.logger in conf/log4j.properties to change the log level globally. See also the section Custom Logging levels in the same file to set levels on a per-component basis. You can also use hadoop daemonlog to set log levels on a temporary basis (they are reset on

Re: HDFS Safemode and EC2 EBS?

2009-06-25 Thread Tom White
Hi Chris, You should really start all the slave nodes to be sure that you don't lose data. If you start fewer than #nodes - #replication + 1 nodes then you are virtually guaranteed to lose blocks. Starting 6 nodes out of 10 will cause the filesystem to remain in safe mode, as you've seen. BTW

Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing

2009-06-24 Thread Tom White
Hi Saptarshi, The group permissions open the firewall ports to enable access, but there are no shared keys on the cluster by default. See https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the scripts that shares keys to allow SSH access between machines in the cluster. Cheers, Tom

Re: Looking for correct way to implements WritableComparable in Hadoop-0.17

2009-06-24 Thread Tom White
Hi Kun, The book's code is for 0.20.0. In Hadoop 0.17.x WritableComparable was not generic, so you need a declaration like: public class IntPair implements WritableComparable { } And the compareTo() method should look like this: public int compareTo(Object o) { IntPair ip = (IntPair) o;

Re: Is it possible? I want to group data blocks.

2009-06-24 Thread Tom White
You might be interested in https://issues.apache.org/jira/browse/HDFS-385, where there is discussion about how to add pluggable block placement to HDFS. Cheers, Tom On Tue, Jun 23, 2009 at 5:50 PM, Alex Loddengaarda...@cloudera.com wrote: Hi Hyunsik, Unfortunately you can't control the

Re: Running Hadoop/Hbase in a OSGi container

2009-06-11 Thread Tom White
Hi Ninad, I don't know if anyone has looked at this for Hadoop Core or HBase (although there is this Jira: https://issues.apache.org/jira/browse/HADOOP-4604), but there's some work for making ZooKeeper's jar OSGi compliant at https://issues.apache.org/jira/browse/ZOOKEEPER-425. Cheers, Tom On

Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Tom White
Actually, the space is needed, to be interpreted as a Hadoop option by ToolRunner. Without the space it sets a Java system property, which Hadoop will not automatically pick up. Ian, try putting the options after the classname and see if that helps. Otherwise, it would be useful to see a snippet

Re: InputFormat for fixed-width records?

2009-05-28 Thread Tom White
Hi Stuart, There isn't an InputFormat that comes with Hadoop to do this. Rather than pre-processing the file, it would be better to implement your own InputFormat. Subclass FileInputFormat and provide an implementation of getRecordReader() that returns your implementation of RecordReader to read

Re: SequenceFile and streaming

2009-05-28 Thread Tom White
Hi Walter, On Thu, May 28, 2009 at 6:52 AM, walter steffe ste...@tiscali.it wrote: Hello  I am a new user and I would like to use hadoop streaming with SequenceFile in both input and output side. -The first difficoulty arises from the lack of a simple tool to generate a SequenceFile

Re: avoid custom crawler getting blocked

2009-05-27 Thread Tom White
Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has solved this kind of problem. Cheers, Tom On Wed, May 27, 2009 at 9:58 AM, John Clarke clarke...@gmail.com wrote: My current project is to gather stats from a lot of different documents. We're are not indexing just getting

Re: RandomAccessFile with HDFS

2009-05-25 Thread Tom White
RandomAccessFile isn't supported directly, but you can seek when reading from files in HDFS (see FSDataInputStream's seek() method). Writing at an arbitrary offset in an HDFS file is not supported however. Cheers, Tom On Sun, May 24, 2009 at 1:33 PM, Stas Oskin stas.os...@gmail.com wrote: Hi.

Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Tom White
You can't use it yet, but https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable interface to place replicas of blocks in HDFS) would enable you to write your own policy so blocks are never placed locally. Might be worth following its development to check it can meet your need?

Re: Number of maps and reduces not obeying my configuration

2009-05-21 Thread Tom White
On Thu, May 21, 2009 at 5:18 AM, Foss User foss...@gmail.com wrote: On Wed, May 20, 2009 at 3:18 PM, Tom White t...@cloudera.com wrote: The number of maps to use is calculated on the client, since splits are computed on the client, so changing the value of mapred.map.tasks only

Re: Shutdown in progress exception

2009-05-21 Thread Tom White
On Wed, May 20, 2009 at 10:22 PM, Stas Oskin stas.os...@gmail.com wrote: You should only use this if you plan on manually closing FileSystems yourself from within your own shutdown hook. It's somewhat of an advanced feature, and I wouldn't recommend using this patch unless you fully

Re: multiple results for each input line

2009-05-21 Thread Tom White
steered me in the right direction! Thanks John 2009/5/20 Tom White t...@cloudera.com Hi John, You could do this with a map only-job (using NLineInputFormat, and setting the number of reducers to 0), and write the output key as docnameN,stat1,stat2,stat3,stat12 and a null value

Re: Number of maps and reduces not obeying my configuration

2009-05-20 Thread Tom White
The number of maps to use is calculated on the client, since splits are computed on the client, so changing the value of mapred.map.tasks only on the jobtracker will not have any effect. Note that the number of map tasks that you set is only a suggestion, and depends on the number of splits

Re: multiple results for each input line

2009-05-20 Thread Tom White
Hi John, You could do this with a map only-job (using NLineInputFormat, and setting the number of reducers to 0), and write the output key as docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes that you calculate all 12 statistics in one map. Each output file would have a single

Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Tom White
On Fri, May 15, 2009 at 11:06 PM, Owen O'Malley omal...@apache.org wrote: On May 15, 2009, at 2:05 PM, Aaron Kimball wrote: In either case, there's a dependency there. You need to split it so that there are no cycles in the dependency tree. In the short term it looks like: avro: core:

Re: Shutdown in progress exception

2009-05-20 Thread Tom White
Looks like you are trying to copy file to HDFS in a shutdown hook. Since you can't control the order in which shutdown hooks run, this is won't work. There is a patch to allow Hadoop's FileSystem shutdown hook to be disabled so it doesn't close filesystems on exit. See

Re: Access to local filesystem working folder in map task

2009-05-19 Thread Tom White
Hi Chris, The task-attempt local working folder is actually just the current working directory of your map or reduce task. You should be able to pass your legacy command line exe and other files using the -files option (assuming you are using the Java interface to write your job, and you are

Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Tom White
On Mon, May 18, 2009 at 11:44 AM, Steve Loughran ste...@apache.org wrote: Grace wrote: To follow up this question, I have also asked help on Jrockit forum. They kindly offered some useful and detailed suggestions according to the JRA results. After updating the option list, the performance

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
foxed now. Joydeep -Original Message- From: Joydeep Sen Sarma [mailto:jssa...@facebook.com] Sent: Wednesday, May 13, 2009 9:38 PM To: core-user@hadoop.apache.org Cc: Tom White Subject: RE: public IP for datanode on EC2 Thanks Philip. Very helpful (and great blog post)! This seems

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
(and resolve to public ip addresses from outside). The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small. -Original Message- From: Tom White

Re: HDFS to S3 copy problems

2009-05-12 Thread Tom White
- From: Tom White [mailto:t...@cloudera.com] Sent: Friday, May 08, 2009 1:36 AM To: core-user@hadoop.apache.org Subject: Re: HDFS to S3 copy problems Perhaps we should revisit the implementation of NativeS3FileSystem so that it doesn't always buffer the file on the client. We could have

Re: Mixing s3, s3n and hdfs

2009-05-08 Thread Tom White
Hi Kevin, The s3n filesystem treats each file as a single block, however you may be able to split files by setting the number of mappers appropriately (or setting mapred.max.split.size in the new MapReduce API in 0.20.0). S3 supports range requests, and the s3n implementation uses them, so it

Re: About Hadoop optimizations

2009-05-07 Thread Tom White
On Thu, May 7, 2009 at 6:05 AM, Foss User foss...@gmail.com wrote: Thanks for your response again. I could not understand a few things in your reply. So, I want to clarify them. Please find my questions inline. On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote: On Wed, May

Re: move tasks to another machine on the fly

2009-05-06 Thread Tom White
Hi David, The MapReduce framework will attempt to rerun failed tasks automatically. However, if a task is running out of memory on one machine, it's likely to run out of memory on another, isn't it? Have a look at the mapred.child.java.opts configuration property for the amount of memory that

Re: Using multiple FileSystems in hadoop input

2009-05-06 Thread Tom White
Hi Ivan, I haven't tried this combination, but I think it should work. If it doesn't it should be treated as a bug. Tom On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote: Greetings to all, Could anyone suggest if Paths from different FileSystems can be used as input

Re: multi-line records and file splits

2009-05-06 Thread Tom White
Hi Rajarshi, FileInputFormat (SDFInputFormat's superclass) will break files into splits, typically on HDFS block boundaries (if the defaults are left unchanged). This is not a problem for your code however, since it will read every record that starts within a split (even if it crosses a split

Re: Specifying System Properties in the had

2009-04-30 Thread Tom White
Another way to do this would be to set a property in the Hadoop config itself. In the job launcher you would have something like: JobConf conf = ... conf.setProperty(foo, test); Then you can read the property in your map or reduce task. Tom On Thu, Apr 30, 2009 at 3:25 PM, Aaron Kimball

Re: Patching and bulding produces no libcordio or libhdfs

2009-04-28 Thread Tom White
Have a look at the instructions on http://wiki.apache.org/hadoop/HowToRelease under the Building section. It tells you which environment settings and Ant targets you need to set. Tom On Tue, Apr 28, 2009 at 9:09 AM, Sid123 itis...@gmail.com wrote: HI I have applied a small patch for version

Re: How to run many jobs at the same time?

2009-04-21 Thread Tom White
You need to start each JobControl in its own thread so they can run concurrently. Something like: Thread t = new Thread(jobControl); t.start(); Then poll the jobControl.allFinished() method. Tom On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr nguyenhuynh...@gmail.com wrote: Hi all!

Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-16 Thread Tom White
Not sure if will affect your findings, but when you read from a FSDataInputStream you should see how many bytes were actually read by inspecting the return value and re-read if it was fewer than you want. See Hadoop's IOUtils readFully() method. Tom On Mon, Apr 13, 2009 at 4:22 PM, Brian

Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Tom White
Does it work if you use addArchiveToClassPath()? Also, it may be more convenient to use GenericOptionsParser's -libjars option. Tom On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote: Hi all, I'm stumped as to how to use the distributed cache's classpath feature. I have

Re: RecordReader design heuristic

2009-03-18 Thread Tom White
Hi Josh, The other aspect to think about when writing your own record reader is input splits. As Jeff mentioned you really want mappers to be processing about one HDFS block's worth of data. If your inputs are significantly smaller, the overhead of creating mappers will be high and your jobs will

Re: Problem with com.sun.pinkdots.LogHandler

2009-03-17 Thread Tom White
Hi Paul, Looking at the stack trace, the exception is being thrown from your map method. Can you put some debugging in there to diagnose it? Detecting and logging the size of the array and the index you are trying to access should help. You can write to standard error and look in the task logs.

Re: contrib EC2 with hadoop 0.17

2009-03-05 Thread Tom White
I haven't used Eucalyptus, but you could start by trying out the Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your Eucalyptus installation. Cheers, Tom On Tue, Mar 3, 2009 at 2:51 PM, falcon164 mujahid...@gmail.com wrote: I am new to hadoop. I want to run hadoop on

Re: Hadoop AMI for EC2

2009-03-05 Thread Tom White
Hi Richa, Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2. Tom On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal richa...@gmail.com wrote: Hi All, Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it? Thanks, Richa Khandelwal University Of California,

Re: MapReduce jobs with expensive initialization

2009-03-02 Thread Tom White
On any particular tasktracker slot, task JVMs are shared only between tasks of the same job. When the job is complete the task JVM will go away. So there is certainly no sharing between jobs. I believe the static singleton approach outlined by Scott will work since the map classes are in a single

Re: OutOfMemory error processing large amounts of gz files

2009-02-25 Thread Tom White
Do you experience the problem with and without native compression? Set hadoop.native.lib to false to disable native compression. Cheers, Tom On Tue, Feb 24, 2009 at 9:40 PM, Gordon Mohr goj...@archive.org wrote: If you're doing a lot of gzip compression/decompression, you *might* be hitting

  1   2   >