Re: Problem with setting up the cluster

2009-06-25 Thread Tom White
Have a look at the datanode log files on the datanode machines and see
what the error is in there.

Cheers,
Tom

On Thu, Jun 25, 2009 at 6:21 AM, .ke. sivakumarkesivaku...@gmail.com wrote:
 Hi all, I'm a student and I have been tryin to set up the hadoop cluster for
 a while
 but have been unsuccessful till now.

 I'm tryin to setup a 4-node cluster
 1 - namenode
 1 - job tracker
 2 - datanode / task tracker

 version of hadoop - 0.18.3

 *config in hadoop-site.xml* (which I have replicated in all the 4 nodes conf
 directory)

 ***
  configuration
  property
  namemapred.job.tracker /name
  valueswin07.xx.xx.edu:9001/value
  /property
  property
  namefs.default.name/name
  valuehdfs://swin06.xx.xx.edu:9000/value
  /property
  property
  namedfs.data.dir/name
  value/home/kesivakumar/hadoop/dfs/data/value
  finaltrue/final
  /property
 property
  namedfs.name.dir/name
  value/home/kesivakumar/hadoop/dfs/name/value
  finaltrue/final
  /property
  property
  namehadoop.tmp.dir /name
  value/tmp/hadoop /value
  finaltrue/final
  /property
  property
  namemapred.system.dir/name
  value/hadoop/mapred/system/value
  finaltrue/final
  /property
  property
  namedfs.replication/name
  value1/value
  /property
 /configuration

 ***


 The problem is both of my datanodes are dead..
 The *slave files are configured properly* and to my surprise the
 tasktrackers are running
 (checked thru swin07:50030 and it showed 2 tasktrackers running)
 (swin06:50070 showed namenode is active but 0 data nodes active)
 so when i try copying conf dir into the filesystem using -put cmd
 it throws me errors. Below is the last-part of the error o/p



 

 09/06/25 00:36:30 WARN dfs.DFSClient: NotReplicatedYetException sleeping
 /user/kesivakumar/input/hadoop-env.sh retries left 1
 09/06/25 00:36:34 WARN dfs.DFSClient: DataStreamer Exception:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /user/kesivakumar/input/hadoop-env.sh could only be replicated to 0 nodes,
 instead of 1
        at
 org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123)
        at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)         at
 org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922)

 09/06/25 00:36:34 WARN dfs.DFSClient: Error Recovery for block null bad
 datanode[0]
 put: Could not get block locations. Aborting...
 Exception closing file /user/kesivakumar/input/hadoop-env.sh
 java.io.IOException: Could not get block locations. Aborting...
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2153)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1745)
        at
 org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1899)

 *


 When I tried  bin/hadoop dfsadmin -report :- it says that 0 datanodes are
 available..

 And also while formatting the namenode usin  bin/hadoop namenode -format
 the format is done at  swin06/*127.0.1.1*   ---   why it is not gettin done
 at 127.0.0.1 ???

 How can I rectify these errors ???

 Any help would be greatly appreciated

 Thank You..



Re: Unable to run Jar file in Hadoop.

2009-06-25 Thread Tom White
Hi Krishna,

You get this error when the jar file cannot be found. It looks like
/user/hadoop/hadoop-0.18.0-examples.jar is an HDFS path, when in fact
it should be a local path.

Cheers,
Tom

On Thu, Jun 25, 2009 at 9:43 AM, krishna prasannasvk_prasa...@yahoo.com wrote:
 Oh! thanks Shravan

 Krishna.



 
 From: Shravan Mahankali shravan.mahank...@catalytic.com
 To: core-user@hadoop.apache.org
 Sent: Thursday, 25 June, 2009 1:50:51 PM
 Subject: RE: Unable to run Jar file in Hadoop.

 Am as well having similar... there is no solution yet!!!

 Thank You,
 Shravan Kumar. M
 Catalytic Software Ltd. [SEI-CMMI Level 5 Company]
 -
 This email and any files transmitted with it are confidential and intended
 solely for the use of the individual or entity to whom they are addressed.
 If you have received this email in error please notify the system
 administrator - netopshelpd...@catalytic.com

 -Original Message-
 From: krishna prasanna [mailto:svk_prasa...@yahoo.com]
 Sent: Thursday, June 25, 2009 1:01 PM
 To: core-user@hadoop.apache.org
 Subject: Unable to run Jar file in Hadoop.

 Hi,

 When i am trying to run a Jar in Hadoop, it is giving me the following error

 had...@krishna-dev:/usr/local/hadoop$ bin/hadoop jar
 /user/hadoop/hadoop-0.18.0-examples.jar
 java.io.IOException: Error opening job jar:
 /user/hadoop/hadoop-0.18.0-examples.jar
     at org.apache.hadoop.util.RunJar.main(RunJar.java:90)
     at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
     at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
 Caused by: java.util.zip.ZipException: error in opening zip file
     at java.util.zip.ZipFile.open(Native Method)
     at java.util.zip.ZipFile.init(ZipFile.java:114)
     at java.util.jar.JarFile.init(JarFile.java:133)
     at java.util.jar.JarFile.init(JarFile.java:70)
     at org.apache.hadoop.util.RunJar.main(RunJar.java:88)
     ... 4 more

 Here is the file permissions,
  rw-r--r-   2 hadoop supergroup  91176 2009-06-25 12:49
 /user/hadoop/hadoop-0.18.0-examples.jar

 Some body help me on this

 Thanks in advance,
 Krishna.


       Cricket on your mind? Visit the ultimate cricket website. Enter
 http://cricket.yahoo.com


      Cricket on your mind? Visit the ultimate cricket website. Enter 
 http://cricket.yahoo.com


Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
Hi Usman,

Before the rebalancer was introduced one trick people used was to
increase the replication on all the files in the system, wait for
re-replication to complete, then decrease the replication to the
original level. You can do this using hadoop fs -setrep.

Cheers,
Tom

On Thu, Jun 25, 2009 at 10:33 AM, Usman Waheedusm...@opera.com wrote:
 Hi,

 One of our test clusters is running HADOOP 15.3 with replication level set
 to 2. The datanodes are not balanced at all.

 Datanode_1: 52%
 Datanode_2: 82%
 Datanode_3: 30%

 15.3 does not have the rebalancer capability, we are planning to upgrade but
 not for now.

 If i take out Datanode_1 from the cluster (decommission for sometime) will
 HADOOP balance so that Datanode_2 and Datanode_3 will even out to 56%?
 Then i can re-introduce Datanode_1 back into the cluster.

 Comments/Suggestions please?

 Thanks,
 Usman



Re: Rebalancing Hadoop Cluster running 15.3

2009-06-25 Thread Tom White
You can change the value of hadoop.root.logger in
conf/log4j.properties to change the log level globally. See also the
section Custom Logging levels in the same file to set levels on a
per-component basis.

You can also use hadoop daemonlog to set log levels on a temporary
basis (they are reset on restart). I'm not sure if this was in Hadoop
0.15.

Cheers,
Tom

On Thu, Jun 25, 2009 at 11:12 AM, Usman Waheedusm...@opera.com wrote:
 Hi Tom,

 Thanks for the trick :).

 I tried by setting the replication to 3 in the hadoop-default.xml but then
 the namenode-logfile in /var/log/hadoop started getting full with the
 messages marked in bold:

 2009-06-24 14:39:06,338 INFO org.apache.hadoop.dfs.StateChange: STATE*
 SafeModeInfo.leave: Safe mode is OFF.
 2009-06-24 14:39:06,339 INFO org.apache.hadoop.dfs.StateChange: STATE*
 Network topology has 1 racks and 3 datanodes
 2009-06-24 14:39:06,339 INFO org.apache.hadoop.dfs.StateChange: STATE*
 UnderReplicatedBlocks has 48545 blocks
 2009-06-24 14:39:07,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate
 blk_-4602580985572290582 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:07,655 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate
 blk_-4602036196619511999 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:07,666 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate
 blk_-4601863051065326105 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:07,666 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate
 blk_-4601770656364938220 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:10,829 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.addStoredBlock: blockMap updated: 10.20.11.44:50010 is added to
 blk_-4601770656364938220
 2009-06-24 14:39:10,832 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate
 blk_-4601706607039808418 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:10,833 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.45:50010 to replicate
 blk_-4601652202073012439 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:10,834 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate
 blk_-4601470720696217621 to datanode(s) 10.20.11.44:50010
 2009-06-24 14:39:10,834 INFO org.apache.hadoop.dfs.StateChange: BLOCK*
 NameSystem.pendingTransfer: ask 10.20.11.43:50010 to replicate
 blk_-4601267705629076611 to datanode(s) 10.20.11.44:50010
 *2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,899 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,900 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,901 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1
 2009-06-24 14:39:13,901 WARN org.apache.hadoop.fs.FSNamesystem: Not able to
 place enough replicas, still in need of 1*

 It is a very small cluster with limited disk space. The disk was getting
 full because of all these extra messages there were being written to the
 logfile. Eventually the file system would file up and hadoop hangs.
 This happened when i set the dfs.replication = 3 in the hadoop-default.xml
 and restarted the cluster.

 Is there a way i can turn off these WARN messages which are filling up the
 file system. I can run the command on the command line like you advised with
 replication set to 3 and then once done, set it back to 2.
 Currently the dfs.replication is set to 2 in the hadoop-default.xml.

 Thanks,
 Usman

 Hi Usman,

 Before the rebalancer was introduced one trick people used was to
 increase the replication on all the files in the system, wait for
 re-replication to complete, then decrease the replication to the
 original level. You can do this using hadoop fs -setrep.

 Cheers,
 Tom

 On Thu, Jun 25, 2009 at 10:33 AM, Usman Waheedusm...@opera.com wrote:


 Hi,

 One of our test clusters is running HADOOP 15.3 with replication level
 set
 to 2. The datanodes are not balanced at all.

 Datanode_1: 52%
 Datanode_2: 82%
 Datanode_3: 30%

 15.3 does not have the rebalancer capability, we are planning to upgrade
 but
 not for now.

 If i take out Datanode_1 from the cluster (decommission 

Re: HDFS Safemode and EC2 EBS?

2009-06-25 Thread Tom White
Hi Chris,

You should really start all the slave nodes to be sure that you don't
lose data. If you start fewer than #nodes - #replication + 1 nodes
then you are virtually guaranteed to lose blocks. Starting 6 nodes out
of 10 will cause the filesystem to remain in safe mode, as you've
seen.

BTW I'm just created a Jira for EBS support
(https://issues.apache.org/jira/browse/HADOOP-6108) which you might be
interested in.

Cheers,
Tom

On Thu, Jun 25, 2009 at 3:51 PM, Chris Curtincurtin.ch...@gmail.com wrote:
 Hi,

 I am using 0.19.0 on EC2. The Hadoop execution and HDFS directories are on
 EBS volumes mounted to each node in my EC2 cluster. Only the install of
 hadoop is in the AMI. We have 10 EBS volumes and when the cluster starts it
 randomly picks one for each slave. We don't always start all 10 slaves
 depending on what type of work we are going to do.

 Every third or fourth start of the cluster the namenode goes into safemode
 and won't come out automatically. Restarting datanodes and task trackers on
 each of the slaves doesn't help. Not much in the log files besides the error
 about waiting for the available %. Forcing it out of safe mode allows the
 cluster to start working.

 My only thought is that something is being stored on one of the EBS volumes
 not being mounted when starting a smaller configuration (say 6 nodes instead
 of 10). But isn't HDFS fault tolerant so that if there is a missing node it
 carries on?

 Any advice on why the namenode and datanodes can't find all the data blocks?
 Or where to look for more information about what might be going on?

 Thanks,

 Chris



Re: EC2, Hadoop, copy file from CLUSTER_MASTER to CLUSTER, failing

2009-06-24 Thread Tom White
Hi Saptarshi,

The group permissions open the firewall ports to enable access, but
there are no shared keys on the cluster by default. See
https://issues.apache.org/jira/browse/HADOOP-4131 for a patch to the
scripts that shares keys to allow SSH access between machines in the
cluster.

Cheers,
Tom

On Sat, Jun 20, 2009 at 7:09 AM, Saptarshi Guhasaptarshi.g...@gmail.com wrote:
 Hello,
 I have a cluster with 1 master and 1 slave (testing). In the EC2
 scripts, in the hadoop-ec2-init-remote.sh file, I wish to copy a file
 from
 the MASTER to the CLUSTER i.e in the slave section

  scp $MASTER_HOST:/tmp/v  /tmp/v

 However, this didnt work and when I logged in, ssh'd to the slave and
 tried the command, I got the following error:

 Permission denied (publickey,gssapi-with-mic)

 Yet, the group permissions appear to be valid i.e
  ec2-authorize $CLUSTER_MASTER -o $CLUSTER -u $AWS_ACCOUNT_ID
  ec2-authorize $CLUSTER -o $CLUSTER_MASTER -u $AWS_ACCOUNT_ID

 So I don't see why I can't ssh into the MASTER group from a slave.

 Any suggestion as to where I'm going wrong?
 Regards
 Saptarshi

 P.S I know I can copy a file from S3, but would like to know what is
 going on here.



Re: Looking for correct way to implements WritableComparable in Hadoop-0.17

2009-06-24 Thread Tom White
Hi Kun,

The book's code is for 0.20.0. In Hadoop 0.17.x WritableComparable was
not generic, so you need a declaration like:

public class IntPair implements WritableComparable {

}

And the compareTo() method should look like this:

public int compareTo(Object o) {
   IntPair ip = (IntPair) o;
   int cmp = compare(first, ip.first);
   if (cmp != 0) {
 return cmp;
   }
   return compare(second, ip.second);
 }

Finally, if you are using Java 5 you should remove the @Override annotations.

Cheers,
Tom

On Sun, Jun 21, 2009 at 1:16 AM, Kunsheng Chen ke...@yahoo.com wrote:

 Hello everyone,

 I am writing my own Comparator inherits from WritableComparable.

 I got the folliowing code from Hadoop definitive guide, which is not 
 working at all, it reminds me WritableComparable does not take parameter. 
 The book might be using Hadoop-0.21

 I also tried the old method for 0.18 version as below:

 http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/io/WritableComparable.html

 but it will reminds me hasn't implement compareTo method, which actually I 
 did.

 I am wondering if I have to reinstall the hadoop again (I prefer not) or 
 there was any old way to do it.


 Any idea is well appreciated!

 -Kun

 --

 import java.io.*;

 import org.apache.hadoop.io.*;

 public class IntPair implements WritableComparableIntPair {

  private int first;
  private int second;
  private Text third;

  public IntPair(int first, int second, Text third) {
    set(first, second, third);
  }

  public void set(int first, int second, Text third) {
    this.first = first;
    this.second = second;
    this.third = third;
  }

  public int getFirst() {
    return first;
  }

  public int getSecond() {
    return second;
  }

  public Text getThird() {

    return third;
  }

 �...@override
  public void write(DataOutput out) throws IOException {
    out.writeInt(first);
    out.writeInt(second);
    third.write(out);
  }

 �...@override
  public void readFields(DataInput in) throws IOException {
    first = in.readInt();
    second = in.readInt();
    // Redundant
    third.readFields(in);

  }

 �...@override
  public int hashCode() {
    return first * 163 + second + third.hashCode();
  }

 �...@override
  public boolean equals(Object o) {
    if (o instanceof IntPair) {
      IntPair ip = (IntPair) o;
      return first == ip.first  second == ip.second  
 third.equals(ip.third);
    }
    return false;
  }
 �...@override
  public String toString() {
    return first + \t + second + \t + third;
  }

 �...@override
  public int compareTo(IntPair ip) {
    int cmp = compare(first, ip.first);
    if (cmp != 0) {
      return cmp;
    }
    return compare(second, ip.second);
  }

  /**
   * Convenience method for comparing two ints.
   */
  public static int compare(int a, int b) {
    return (a  b ? -1 : (a == b ? 0 : 1));
  }

 }

 





Re: Is it possible? I want to group data blocks.

2009-06-24 Thread Tom White
You might be interested in
https://issues.apache.org/jira/browse/HDFS-385, where there is
discussion about how to add pluggable block placement to HDFS.

Cheers,
Tom

On Tue, Jun 23, 2009 at 5:50 PM, Alex Loddengaarda...@cloudera.com wrote:
 Hi Hyunsik,

 Unfortunately you can't control the servers that blocks go on.  Hadoop does
 block allocation for you, and it tries its best to distribute data evenly
 among the cluster, so long as replicated blocks reside on different
 machines, on different racks (assuming you've made Hadoop rack-aware).

 Hope this clears things up.

 Alex

 2009/6/23 Hyunsik Choi c0d3h...@gmail.com

 Hi all,

 I would like to give data locality. In other words, I want to place
 certain data blocks on one machine. In some problems, subsets of an
 entire dataset need one another for answer. Most of the graph problems
 are good examples.

 Is it possible? If impossible, can you advice about that?

 Thank you in advance.

 - Hyunsik Choi -




Re: Running Hadoop/Hbase in a OSGi container

2009-06-11 Thread Tom White
Hi Ninad,

I don't know if anyone has looked at this for Hadoop Core or HBase
(although there is this Jira:
https://issues.apache.org/jira/browse/HADOOP-4604), but there's some
work for making ZooKeeper's jar OSGi compliant at
https://issues.apache.org/jira/browse/ZOOKEEPER-425.

Cheers,
Tom

On Thu, Jun 11, 2009 at 1:10 AM, Ninad Rautninad.evera...@gmail.com wrote:
 Hi,

 Our architecture team wants to run Hadoop/Hbase and the mapreduce jobs using
 OSGi container. This is to take advantages of the OSGi framework to have a
 pluggable architecture.
 I have searched through the net and looks like people are working or have
 achieved success in this. Can some one please help me understand the
 technical feasibility and if feasible the way to move forward?


 Regards,
 Ninad.



Re: Command-line jobConf options in 0.18.3

2009-06-04 Thread Tom White
Actually, the space is needed, to be interpreted as a Hadoop option by
ToolRunner. Without the space it sets a Java system property, which
Hadoop will not automatically pick up.

Ian, try putting the options after the classname and see if that
helps. Otherwise, it would be useful to see a snippet of the program
code.

Thanks,
Tom

On Thu, Jun 4, 2009 at 8:23 PM, Vasyl Keretsman vasi...@gmail.com wrote:
 Perhaps, there should not be the space between -D and your option ?

 -Dprise.collopts=

 Vasyl



 2009/6/4 Ian Soboroff ian.sobor...@nist.gov:

 bin/hadoop jar -files collopts -D prise.collopts=collopts p3l-3.5.jar 
 gov.nist.nlpir.prise.mapred.MapReduceIndexer input output

 The 'prise.collopts' option doesn't appear in the JobConf.

 Ian

 Aaron Kimball aa...@cloudera.com writes:

 Can you give an example of the exact arguments you're sending on the command
 line?
 - Aaron

 On Wed, Jun 3, 2009 at 5:46 PM, Ian Soboroff ian.sobor...@nist.gov wrote:

     If after I call getConf to get the conf object, I manually add the key/
     value pair, it's there when I need it.  So it feels like ToolRunner 
 isn't
     parsing my args for some reason.

     Ian

     On Jun 3, 2009, at 8:45 PM, Ian Soboroff wrote:

         Yes, and I get the JobConf via 'JobConf job = new JobConf(conf,
         the.class)'.  The conf is the Configuration object that comes from
         getConf.  Pretty much copied from the WordCount example (which this
         program used to be a long while back...)

         thanks,
         Ian

         On Jun 3, 2009, at 7:09 PM, Aaron Kimball wrote:

             Are you running your program via ToolRunner.run()? How do you
             instantiate the JobConf object?
             - Aaron

             On Wed, Jun 3, 2009 at 10:19 AM, Ian Soboroff 
             ian.sobor...@nist.gov wrote:
             I'm backporting some code I wrote for 0.19.1 to 0.18.3 (long
             story), and I'm finding that when I run a job and try to pass
             options with -D on the command line, that the option values 
 aren't
             showing up in my JobConf.  I logged all the key/value pairs in 
 the
             JobConf, and the option I passed through with -D isn't there.

             This worked in 0.19.1... did something change with command-line
             options from 18 to 19?

             Thanks,
             Ian




Re: InputFormat for fixed-width records?

2009-05-28 Thread Tom White
Hi Stuart,

There isn't an InputFormat that comes with Hadoop to do this. Rather
than pre-processing the file, it would be better to implement your own
InputFormat. Subclass FileInputFormat and provide an implementation of
getRecordReader() that returns your implementation of RecordReader to
read fixed width records. In the next() method you would do something
like:

byte[] buf = new byte[100];
IOUtils.readFully(in, buf, pos, 100);
pos += 100;

You would also need to check for the end of the stream. See
LineRecordReader for some ideas. You'll also have to handle finding
the start of records for a split, which you can do by looking at the
offset and seeking to the next multiple of 100.

If the RecordReader was a RecordReaderNullWritable, BytesWritable
(no keys) then it would return each record as a byte array to the
mapper, which would then break it into fields. Alternatively, you
could do it in the RecordReader, and use your own type which
encapsulates the fields for the value.

Hope this helps.

Cheers,
Tom

On Thu, May 28, 2009 at 1:15 PM, Stuart White stuart.whi...@gmail.com wrote:
 I need to process a dataset that contains text records of fixed length
 in bytes.  For example, each record may be 100 bytes in length, with
 the first field being the first 10 bytes, the second field being the
 second 10 bytes, etc...  There are no newlines on the file.  Field
 values have been either whitespace-padded or truncated to fit within
 the specific locations in these fixed-width records.

 Does Hadoop have an InputFormat to support processing of such files?
 I looked but couldn't find one.

 Of course, I could pre-process the file (outside of Hadoop) to put
 newlines at the end of each record, but I'd prefer not to require such
 a prep step.

 Thanks.



Re: SequenceFile and streaming

2009-05-28 Thread Tom White
Hi Walter,

On Thu, May 28, 2009 at 6:52 AM, walter steffe ste...@tiscali.it wrote:
 Hello
  I am a new user and I would like to use hadoop streaming with
 SequenceFile in both input and output side.

 -The first difficoulty arises from the lack of a simple tool to generate
 a SequenceFile starting from a set of files in a given directory.
 I would like to have something similar to tar -cvf file.tar foo/
 This should work also in the opposite direction like tar -xvf file.tar

There's a tool for turning tar files into sequence files here:
http://stuartsierra.com/2008/04/24/a-million-little-files


 -An other important feature that I would like to see is the possibility
 to feed the mapper stdin with the whole content of a file (extracted
 from the file SequenceFile) disregarding the key.

Have a look at SequenceFileAsTextInputFormat which will do this for
you (except the key is the sequence file's key).

 Using each file as a tar archive I it would like to be able to do:

  $HADOOP_HOME/bin/hadoop  jar $HADOOP_HOME/hadoop-streaming.jar \
                  -input /user/me/inputSequenceFile  \
                  -output /user/me/outputSequenceFile  \
                  -inputformat SequenceFile
                  -outputformat SequenceFile
                  -mapper myscript.sh
                  -reducer NONE

  myscrip.sh should work as a filter which takes its input from
  stdin and put the output on stdout:

  tar -x
  do something on the generated dir and create an outputfile
  cat outputfile

 The output file should (automatically) go into the outputSequenceFile.

 I think that this would be a very usefull schema which fits well with
 the mapreduce requirements on one side and with the unix commands on the
 other side. It should not be too difficoult to implement the tools
 needed for that.

I totally agree - having more tools to better integrate sequence files
and map files with unix tools would be very handy.

Tom



 Walter










Re: avoid custom crawler getting blocked

2009-05-27 Thread Tom White
Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has
solved this kind of problem.

Cheers,
Tom

On Wed, May 27, 2009 at 9:58 AM, John Clarke clarke...@gmail.com wrote:
 My current project is to gather stats from a lot of different documents.
 We're are not indexing just getting quite specific stats for each document.
 We gather 12 different stats from each document.

 Our requirements have changed somewhat now, originally it was working on
 documents from our own servers but now it needs to fetch other ones from
 quite a large variety of sources.

 My approach up to now was to have the map function simply take each filepath
 (or now URL) in turn, fetch the document, calculate the stats and output
 those stats.

 My new problem is some of the locations we are now visiting don't like their
 IP being hit multiple times in a row.

 Is it possible to check a URL against a visited list of IPs and if recently
 visited either wait for a certain amount of time or push it back onto the
 input stack so it will be processed later in the queue?

 Or is there a better way?

 Thanks,
 John



Re: RandomAccessFile with HDFS

2009-05-25 Thread Tom White
RandomAccessFile isn't supported directly, but you can seek when
reading from files in HDFS (see FSDataInputStream's seek() method).
Writing at an arbitrary offset in an HDFS file is not supported
however.

Cheers,
Tom

On Sun, May 24, 2009 at 1:33 PM, Stas Oskin stas.os...@gmail.com wrote:
 Hi.

 Any idea if RandomAccessFile is going to be supported in HDFS?

 Regards.



Re: Circumventing Hadoop's data placement policy

2009-05-23 Thread Tom White
You can't use it yet, but
https://issues.apache.org/jira/browse/HADOOP-3799 (Design a pluggable
interface to place replicas of blocks in HDFS) would enable you to
write your own policy so blocks are never placed locally. Might be
worth following its development to check it can meet your need?

Cheers,
Tom

On Sat, May 23, 2009 at 8:06 PM, jason hadoop jason.had...@gmail.com wrote:
 Can you give your machines multiple IP addresses, and bind the grid server
 to a different IP than the datanode
 With solaris you could put it in a different zone,

 On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman 
 bbock...@math.unl.eduwrote:

 Hey all,

 Had a problem I wanted to ask advice on.  The Caltech site I work with
 currently have a few GridFTP servers which are on the same physical machines
 as the Hadoop datanodes, and a few that aren't.  The GridFTP server has a
 libhdfs backend which writes incoming network data into HDFS.

 They've found that the GridFTP servers which are co-located with HDFS
 datanode have poor performance because data is incoming at a much faster
 rate than the HDD can handle.  The standalone GridFTP servers, however, push
 data out to multiple nodes at one, and can handle the incoming data just
 fine (200MB/s).

 Is there any way to turn off the preference for the local node?  Can anyone
 think of a good workaround to trick HDFS into thinking the client isn't on
 the same node?

 Brian




 --
 Alpha Chapters of my book on Hadoop are available
 http://www.apress.com/book/view/9781430219422
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Number of maps and reduces not obeying my configuration

2009-05-21 Thread Tom White
On Thu, May 21, 2009 at 5:18 AM, Foss User foss...@gmail.com wrote:
 On Wed, May 20, 2009 at 3:18 PM, Tom White t...@cloudera.com wrote:
 The number of maps to use is calculated on the client, since splits
 are computed on the client, so changing the value of mapred.map.tasks
 only on the jobtracker will not have any effect.

 Note that the number of map tasks that you set is only a suggestion,
 and depends on the number of splits actually created. In your case it
 looks like 4 splits were created. As a rule, you shouldn't set the
 number of map tasks, since by default one map task is created for each
 HDFS block, which works well for most applications. This is explained
 further in the javadoc:
 http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)

 The number of reduces to use is determined by the JobConf that is
 created on the client, so it uses the client's hadoop-site.xml, not
 the jobtracker's one. This is why it is set to 1, even though you set
 it to 2 on the jobtracker.

 If you don't want to set configuration properties in code (and I agree
 it's often a good idea not to hardcode things like the number of maps
 or reduces in code), then you can make your driver use Tool and
 ToolRunner as Chuck explained.

 Finally, in general you should try to keep hadoop-site.xml the same
 across your clients and cluster nodes to avoid surprises about which
 value has been set.

 Hope this helps,

 Tom

 By client do you mean the machine where I logged in and invoked
 'hadoop jar' command to submit and run my job?

Yes.


Re: Shutdown in progress exception

2009-05-21 Thread Tom White
On Wed, May 20, 2009 at 10:22 PM, Stas Oskin stas.os...@gmail.com wrote:

 You should only use this if you plan on manually closing FileSystems
 yourself from within your own shutdown hook. It's somewhat of an advanced
 feature, and I wouldn't recommend using this patch unless you fully
 understand the ramifications of modifying the shutdown sequence.


 Standard dfs.close() would do the trick, no?

After you've performed your application shutdown actions you should
call FileSystem's closeAll() method.



 Just uploaded a patch based on branch 18 for you to that JIRA.


 Thanks a lot!



Re: multiple results for each input line

2009-05-21 Thread Tom White
You could combine them into one file using a reduce stage (with a
single reducer), or by using hadoop fs -getmerge on the output
directory.

Cheers,
Tom

On Thu, May 21, 2009 at 3:14 PM, John Clarke clarke...@gmail.com wrote:
 Hi,

 I want one output file not multiple but I think your reply has steered me in
 the right direction!
 Thanks
 John

 2009/5/20 Tom White t...@cloudera.com

 Hi John,

 You could do this with a map only-job (using NLineInputFormat, and
 setting the number of reducers to 0), and write the output key as
 docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes
 that you calculate all 12 statistics in one map. Each output file
 would have a single line in it.

 Cheers,
 Tom

 On Wed, May 20, 2009 at 10:21 AM, John Clarke clarke...@gmail.com wrote:
  Hi,
 
  I'm having some trouble implementing what I want to achieve...
 essentially I
  have a large input list of documents that I want to get statistics on.
 For
  each document I have 12 different stats to work out.
 
  So my input file is a text file with one document filepath on each line.
 The
  documents are stored on a remote server. I want to fetch each document
 and
  calculate certain stats from it.
 
  My problem is with the output.
 
  I want my output to be similar to this:
 
  docname1,stat1,stat2,stat3,stat12
  docname2,stat1,stat2,stat3,stat12
  docname3,stat1,stat2,stat3,stat12
  .
  .
  .
  docnameN,stat1,stat2,stat3,stat12
 
  I can fetch the document in my map code and perform my stats calculation
 on
  it but dont know how to return more than one value for a key, the key in
  this case being the document name.
 
  Cheers,
  John
 




Re: Number of maps and reduces not obeying my configuration

2009-05-20 Thread Tom White
The number of maps to use is calculated on the client, since splits
are computed on the client, so changing the value of mapred.map.tasks
only on the jobtracker will not have any effect.

Note that the number of map tasks that you set is only a suggestion,
and depends on the number of splits actually created. In your case it
looks like 4 splits were created. As a rule, you shouldn't set the
number of map tasks, since by default one map task is created for each
HDFS block, which works well for most applications. This is explained
further in the javadoc:
http://hadoop.apache.org/core/docs/r0.19.1/api/org/apache/hadoop/mapred/JobConf.html#setNumMapTasks(int)

The number of reduces to use is determined by the JobConf that is
created on the client, so it uses the client's hadoop-site.xml, not
the jobtracker's one. This is why it is set to 1, even though you set
it to 2 on the jobtracker.

If you don't want to set configuration properties in code (and I agree
it's often a good idea not to hardcode things like the number of maps
or reduces in code), then you can make your driver use Tool and
ToolRunner as Chuck explained.

Finally, in general you should try to keep hadoop-site.xml the same
across your clients and cluster nodes to avoid surprises about which
value has been set.

Hope this helps,

Tom

On Wed, May 20, 2009 at 5:21 AM, Foss User foss...@gmail.com wrote:
 On Wed, May 20, 2009 at 3:39 AM, Chuck Lam chuck@gmail.com wrote:
 Can you set the number of reducers to zero and see if it becomes a map only
 job? If it does, then it's able to read in the mapred.reduce.tasks property
 correctly but just refuse to have 2 reducers. In that case, it's most likely
 you're running in local mode, which doesn't allow more than 1 reducer.

 As I have already mentioned in my original mail, I am not running it
 in local mode. Quoting from my original mail:

 My configuration file is set as follows:

 mapred.map.tasks = 2
 mapred.reduce.tasks = 2

 However, the description of these properties mention that these
 settings would be ignored if mapred.job.tracker is set as 'local'.
 Mine is set properly with IP address, port number.


 If setting zero doesn't change anything, then your config file is not being
 read, or it's being overridden.

 As an aside, if you use ToolRunner in your Hadoop program, then it will
 support generic options such that you can run your program with the option
 -D mapred.reduce.tasks=2
 to tell it to use 2 reducers. This allows you to set the number of reducers
 on a per-job basis.



 I understand that it is being overridden by something else. What I
 want to know is which file is overriding it. Also, please note that I
 have these settings only in the conf/hadoop-site.xml of job tracker
 node. Is that enough?



Re: multiple results for each input line

2009-05-20 Thread Tom White
Hi John,

You could do this with a map only-job (using NLineInputFormat, and
setting the number of reducers to 0), and write the output key as
docnameN,stat1,stat2,stat3,stat12 and a null value. This assumes
that you calculate all 12 statistics in one map. Each output file
would have a single line in it.

Cheers,
Tom

On Wed, May 20, 2009 at 10:21 AM, John Clarke clarke...@gmail.com wrote:
 Hi,

 I'm having some trouble implementing what I want to achieve... essentially I
 have a large input list of documents that I want to get statistics on. For
 each document I have 12 different stats to work out.

 So my input file is a text file with one document filepath on each line. The
 documents are stored on a remote server. I want to fetch each document and
 calculate certain stats from it.

 My problem is with the output.

 I want my output to be similar to this:

 docname1,stat1,stat2,stat3,stat12
 docname2,stat1,stat2,stat3,stat12
 docname3,stat1,stat2,stat3,stat12
 .
 .
 .
 docnameN,stat1,stat2,stat3,stat12

 I can fetch the document in my map code and perform my stats calculation on
 it but dont know how to return more than one value for a key, the key in
 this case being the document name.

 Cheers,
 John



Re: Linking against Hive in Hadoop development tree

2009-05-20 Thread Tom White
On Fri, May 15, 2009 at 11:06 PM, Owen O'Malley omal...@apache.org wrote:

 On May 15, 2009, at 2:05 PM, Aaron Kimball wrote:

 In either case, there's a dependency there.

 You need to split it so that there are no cycles in the dependency tree. In 
 the short term it looks like:

 avro:
 core: avro
 hdfs: core
 mapred: hdfs, core

Why does mapred depend on hdfs? MapReduce should only depend on the
FileSystem interface, shouldn't it?

Tom

 hive: mapred, core
 pig: mapred, core

 Adding a dependence from core to hive would be bad. To integrate with Hive, 
 you need to add a contrib module to Hive that adds it.

 -- Owen


Re: Shutdown in progress exception

2009-05-20 Thread Tom White
Looks like you are trying to copy file to HDFS in a shutdown hook.
Since you can't control the order in which shutdown hooks run, this is
won't work. There is a patch to allow Hadoop's FileSystem shutdown
hook to be disabled so it doesn't close filesystems on exit. See
https://issues.apache.org/jira/browse/HADOOP-4829.

Cheers,
Tom

On Tue, May 19, 2009 at 8:44 AM, Stas Oskin stas.os...@gmail.com wrote:
 Hi.

 Does anyone has any idea on this issue?

 Thanks!

 2009/5/17 Stas Oskin stas.os...@gmail.com

 Hi.

 I have an issue where my application, when shutting down (at ShutdownHook
 level), is unable to copy files to HDFS.

 Each copy throws the following exception:

 java.lang.IllegalStateException: Shutdown in progress
             at
 java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:39)
             at java.lang.Runtime.addShutdownHook(Runtime.java:192)
             at
 org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1353)
             at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:213)
             at
 org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:189)
             at
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1185)
             at
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1161)
             at
 org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1133)
             at app.util.FileUtils.copyToCluster(FileUtils.java:392)


 I read some reports, including this one (
 https://issues.apache.org/jira/browse/HADOOP-3818), but there was no
 definite answer how to do it.

 Is there any good solution to this issue?

 Thanks in advance.




Re: Access to local filesystem working folder in map task

2009-05-19 Thread Tom White
Hi Chris,

The task-attempt local working folder is actually just the current
working directory of your map or reduce task. You should be able to
pass your legacy command line exe and other files using the -files
option (assuming you are using the Java interface to write your job,
and you are implementing Tool; streaming also supports the -files
option) and they will appear in the local working folder. You
shouldn't have to use the DistributedCache class directly at all.

Cheers,
Tom

On Tue, May 19, 2009 at 2:21 PM, Chris Carman kri...@redlab.ee wrote:
 hi users,

 I have started writing my first project on Hadoop and am now seeking some
 guidance from more experienced members.

 The project is about running some CPU intensive computations in parallel and
 should be a straightforward application for MapReduce, as the input dataset
 can easily be partitioned to independent jobs and the final aggregation is a
 low cost step. The application, however, relies on a legacy command line exe
 file (which runs OK under wine). It reads about 10 small files (5mb) from its
 working folder and produces another 10 as a result.

 I can easily send those files and the app to all nodes via DistributedCache so
 that they get stored read-only to the local file system. I now need to get a
 local working folder for the task-attempt, where I could copy or symlink the
 relevant inputs, execute the legacy exe, and read off the output. As I
 understand, the task is returning an HDFS location when I ask for
 FileOutputFormat.getWorkOutputPath(job);

 I read from docs that there should be task-attempt local working folder, but I
 struggle to find a way to get the filesystem path to it, so that I could copy
 files and pass it in to my app for local processing.

 Tell me it's an easy one that I've missed.

 Many Thanks,
 Chris



Re: Is there any performance issue with Jrockit JVM for Hadoop

2009-05-18 Thread Tom White
On Mon, May 18, 2009 at 11:44 AM, Steve Loughran ste...@apache.org wrote:
 Grace wrote:

 To follow up this question, I have also asked help on Jrockit forum. They
 kindly offered some useful and detailed suggestions according to the JRA
 results. After updating the option list, the performance did become better
 to some extend. But it is still not comparable with the Sun JVM. Maybe, it
 is due to the use case with short duration and different implementation in
 JVM layer between Sun and Jrockit. I would like to be back to use Sun JVM
 currently. Thanks all for your time and help.


 what about flipping the switch that says run tasks in the TT's own JVM?.
 That should handle startup costs, and reduce the memory footprint


The property mapred.job.reuse.jvm.num.tasks allows you to set how many
tasks the JVM may be reused for (within a job), but it always runs in
a separate JVM to the tasktracker. (BTW
https://issues.apache.org/jira/browse/HADOOP-3675has some discussion
about running tasks in the tasktracker's JVM).

Tom


Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Hi Joydeep,

The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).

However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
Public and Elastic IP Data Transfer). This is going to get expensive
fast!

So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.

Cheers,
Tom

On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma jssa...@facebook.com wrote:
 I changed the ec2 scripts to have fs.default.name assigned to the public 
 hostname (instead of the private hostname).

 Now I can submit jobs remotely via the socks proxy (the problem below is 
 resolved) - but the map tasks fail with an exception:


 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
 Already tried 9 time(s).
 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
 running child
 java.io.IOException: Call to 
 ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
 local exception: Connection refused
        at org.apache.hadoop.ipc.Client.call(Client.java:699)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
        at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177)
        at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
        at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
        at org.apache.hadoop.mapred.Child.main(Child.java:153)


 strangely enough - job submissions from nodes within the ec2 cluster work 
 just fine. I looked at the job.xml files of jobs submitted locally and 
 remotely and don't see any relevant differences.

 Totally foxed now.

 Joydeep

 -Original Message-
 From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
 Sent: Wednesday, May 13, 2009 9:38 PM
 To: core-user@hadoop.apache.org
 Cc: Tom White
 Subject: RE: public IP for datanode on EC2

 Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
 dfs command line operations work just fine.

 However - I am hitting a new error during job submission (running 
 hadoop-0.19.0):

 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
 (SessionState.java:printError(279)) - Job Submission failed with exception 
 'java.net.UnknownHostException(unknown host: 
 domU-12-31-39-00-51-94.compute-1.internal)'
 java.net.UnknownHostException: unknown host: 
 domU-12-31-39-00-51-94.compute-1.internal
        at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
        at org.apache.hadoop.ipc.Client.call(Client.java:686)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
        at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:176)
        at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
        at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
        at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
        at org.apache.hadoop.mapred.JobClient.getFs(JobClient.java:469)
        at 
 org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:603)
        at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:788)


 looking at the stack trace and the code - it seems that this is happening 
 because the jobclient asks for the mapred system directory from the 
 jobtracker - which replies back with a path name that's qualified against the 
 fs.default.name setting of the jobtracker. Unfortunately the standard

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Yes, you're absolutely right.

Tom

On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma jssa...@facebook.com wrote:
 The ec2 documentation point to the use of public 'ip' addresses - whereas 
 using public 'hostnames' seems safe since it resolves to internal addresses 
 from within the cluster (and resolve to public ip addresses from outside).

 The only data transfer that I would incur while submitting jobs from outside 
 is the cost of copying the jar files and any other files meant for the 
 distributed cache). That would be extremely small.


 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, May 14, 2009 5:58 AM
 To: core-user@hadoop.apache.org
 Subject: Re: public IP for datanode on EC2

 Hi Joydeep,

 The problem you are hitting may be because port 50001 isn't open,
 whereas from within the cluster any node may talk to any other node
 (because the security groups are set up to do this).

 However I'm not sure this is a good approach. Configuring Hadoop to
 use public IP addresses everywhere should work, but you have to pay
 for all data transfer between nodes (see http://aws.amazon.com/ec2/,
 Public and Elastic IP Data Transfer). This is going to get expensive
 fast!

 So to get this to work well, we would have to make changes to Hadoop
 so it was aware of both public and private addresses, and use the
 appropriate one: clients would use the public address, while daemons
 would use the private address. I haven't looked at what it would take
 to do this or how invasive it would be.

 Cheers,
 Tom

 On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma jssa...@facebook.com 
 wrote:
 I changed the ec2 scripts to have fs.default.name assigned to the public 
 hostname (instead of the private hostname).

 Now I can submit jobs remotely via the socks proxy (the problem below is 
 resolved) - but the map tasks fail with an exception:


 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect 
 to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. 
 Already tried 9 time(s).
 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error 
 running child
 java.io.IOException: Call to 
 ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on 
 local exception: Connection refused
        at org.apache.hadoop.ipc.Client.call(Client.java:699)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy1.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
        at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:177)
        at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
        at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
        at org.apache.hadoop.mapred.Child.main(Child.java:153)


 strangely enough - job submissions from nodes within the ec2 cluster work 
 just fine. I looked at the job.xml files of jobs submitted locally and 
 remotely and don't see any relevant differences.

 Totally foxed now.

 Joydeep

 -Original Message-
 From: Joydeep Sen Sarma [mailto:jssa...@facebook.com]
 Sent: Wednesday, May 13, 2009 9:38 PM
 To: core-user@hadoop.apache.org
 Cc: Tom White
 Subject: RE: public IP for datanode on EC2

 Thanks Philip. Very helpful (and great blog post)! This seems to make basic 
 dfs command line operations work just fine.

 However - I am hitting a new error during job submission (running 
 hadoop-0.19.0):

 2009-05-14 00:15:34,430 ERROR exec.ExecDriver 
 (SessionState.java:printError(279)) - Job Submission failed with exception 
 'java.net.UnknownHostException(unknown host: 
 domU-12-31-39-00-51-94.compute-1.internal)'
 java.net.UnknownHostException: unknown host: 
 domU-12-31-39-00-51-94.compute-1.internal
        at org.apache.hadoop.ipc.Client$Connection.init(Client.java:195)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:791)
        at org.apache.hadoop.ipc.Client.call(Client.java:686)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at $Proxy0.getProtocolVersion(Unknown Source)
        at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:348)
        at 
 org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
        at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:176)
        at 
 org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:75)
        at 
 org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
        at org.apache.hadoop.fs.FileSystem.access$200

Re: HDFS to S3 copy problems

2009-05-12 Thread Tom White
Ian - Thanks for the detailed analysis. It was these issues that lead
me to create a temporary file in NativeS3FileSystem in the first
place. I think we can get NativeS3FileSystem to report progress
though, see https://issues.apache.org/jira/browse/HADOOP-5814.

Ken - I can't see why you would be getting that error. Does it work
with hadoop fs, but not hadoop distcp?

Cheers,
Tom

On Sat, May 9, 2009 at 6:48 AM, Nowland, Ian nowl...@amazon.com wrote:
 Hi Tom,

 Not creating a temp file is the ideal as it saves you from having to waste 
 using the local hard disk by writing an output file just before uploading 
 same to Amazon S3. There are a few problems though:

 1) Amazon S3 PUTs need the file length up front. You could use a chunked 
 POST, but then you have the disadvantage of having to Base64 encode all your 
 data, increasing bandwidth usage, and also you still have the next problems;

 2) You would still want to have MD5 checking. In Amazon S3 both PUT and POST 
 require the MD5 to be supplied before the contents. To work around this then 
 you would have to upload the object without MD5, then check its metadata to 
 make sure the MD5 is correct, then delete it if it is not. This is all 
 possible, but would be difficult to make bulletproof, whereas in the current 
 version, if the MD5 is different the PUT fails atomically and you can easily 
 just retry.

 3) Finally, you would have to be careful in reducers that output only very 
 rarely. If there is too big a gap between data being uploaded through the 
 socket, then S3 may determine the connection has timed out, closing the 
 connection and meaning your task has to rerun (perhaps just to hit the same 
 problem again).

 All of this means that the current solution may be best for now as far as 
 general upload. The best I think we can so is fix the fact that the task is 
 not progressed in close(). The best way I can see to do this is introducing a 
 new interface say called ExtendedClosable which defines a close(Progressable 
 p) method. Then, have the various clients of FileSystem output streams (e.g. 
 Distcp, TextOutputFormat) test if their DataOutputStream supports the 
 interface, and if so call this in preference to the default. In the case of 
 NativeS3FileSystem then, this method spins up a thread to keep the 
 Progressable updated as the upload progresses.

 As an additional optimization to Distcp, where the source file already exists 
 we could have some extended interface say ExtendedWriteFileSystem that has a 
 create() method that takes the MD5 and the file size, then test for this 
 interface in the Distcp mapper call the extended method. The trade off here 
 is the fact that the checksum HDFS stored is not the MD5 needed by S3, and so 
 two (perhaps distributed) reads would be needed so the tradeoff is these two 
 distributed reads vs a distributed read and a local write then local read.

 What do you think?

 Cheers,
 Ian Nowland
 Amazon.com

 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Friday, May 08, 2009 1:36 AM
 To: core-user@hadoop.apache.org
 Subject: Re: HDFS to S3 copy problems

 Perhaps we should revisit the implementation of NativeS3FileSystem so
 that it doesn't always buffer the file on the client. We could have an
 option to make it write directly to S3. Thoughts?

 Regarding the problem with HADOOP-3733, you can work around it by
 setting fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey in your
 hadoop-site.xml.

 Cheers,
 Tom

 On Fri, May 8, 2009 at 1:17 AM, Andrew Hitchcock adpow...@gmail.com wrote:
 Hi Ken,

 S3N doesn't work that well with large files. When uploading a file to
 S3, S3N saves it to local disk during write() and then uploads to S3
 during the close(). Close can take a long time for large files and it
 doesn't report progress, so the call can time out.

 As a work around, I'd recommend either increasing the timeout or
 uploading the files by hand. Since you only have a few large files,
 you might want to copy the files to local disk and then use something
 like s3cmd to upload them to S3.

 Regards,
 Andrew

 On Thu, May 7, 2009 at 4:42 PM, Ken Krugler kkrugler_li...@transpac.com 
 wrote:
 Hi all,

 I have a few large files (4 that are 1.8GB+) I'm trying to copy from HDFS to
 S3. My micro EC2 cluster is running Hadoop 0.19.1, and has one master/two
 slaves.

 I first tried using the hadoop fs -cp command, as in:

 hadoop fs -cp output/dir/ s3n://bucket/dir/

 This seemed to be working, as I could walk the network traffic spike, and
 temp files were being created in S3 (as seen with CyberDuck).

 But then it seemed to hang. Nothing happened for 30 minutes, so I killed the
 command.

 Then I tried using the hadoop distcp command, as in:

 hadoop distcp hdfs://host:50001/path/dir/ s3://public key:private
 key@bucket/dir2/

 This failed, because my secret key has a '/' in it
 (http://issues.apache.org/jira/browse/HADOOP-3733)

 Then I tried using hadoop distcp

Re: Mixing s3, s3n and hdfs

2009-05-08 Thread Tom White
Hi Kevin,

The s3n filesystem treats each file as a single block, however you may
be able to split files by setting the number of mappers appropriately
(or setting mapred.max.split.size in the new MapReduce API in 0.20.0).
S3 supports range requests, and the s3n implementation uses them, so
it wouldn't try to download the entire file for each split.

You don't need to run a namenode for S3 filesystems, it is only needed
for HDFS. So it is feasible to run S3 and HDFS in parallel, copying
data from one to the other.

Cheers,
Tom

On Fri, May 8, 2009 at 8:55 AM, Kevin Peterson kpeter...@biz360.com wrote:
 Currently, we are running our cluster in EC2 with HDFS stored on the local
 (i.e. transient) disk. We don't want to deal with EBS, because it
 complicates being able to spin up additional slaves as needed. We're looking
 at moving to a combination of s3 (block) or s3n for data that we care about,
 and leaving lower value data that we can recreate on HDFS.

 My thinking is that s3n has significant advantages in terms of how easy it
 is to import data from non-Hadoop processes, and also the ease of sampling
 data, but I'm not sure how well it actually works. I'm guessing that it
 wouldn't be able to split files, or maybe it would need to download the
 entire file from S3 multiple times to split it? Is the issue with writes
 buffering the entire file on the local machine significant? Our jobs tend to
 be more CPU intensive than the usual kind of log processing type jobs, so we
 usually end up with smaller files.

 Is it feasible to run s3 (block) and hdfs in parallel? Would I need two
 namenodes to do this? Is this a good idea?

 Has anyone tried either of these configurations in EC2?



Re: About Hadoop optimizations

2009-05-07 Thread Tom White
On Thu, May 7, 2009 at 6:05 AM, Foss User foss...@gmail.com wrote:
 Thanks for your response again. I could not understand a few things in
 your reply. So, I want to clarify them. Please find my questions
 inline.

 On Thu, May 7, 2009 at 2:28 AM, Todd Lipcon t...@cloudera.com wrote:
 On Wed, May 6, 2009 at 1:46 PM, Foss User foss...@gmail.com wrote:
 2. Is the meta data for file blocks on data node kept in the
 underlying OS's file system on namenode or is it kept in RAM of the
 name node?


 The block locations are kept in the RAM of the name node, and are updated
 whenever a Datanode does a block report. This is why the namenode is in
 safe mode at startup until it has received block locations for some
 configurable percentage of blocks from the datanodes.


 What is safe mode in namenode? This concept is new to me. Could you
 please explain this?

Safe mode is described here:
http://hadoop.apache.org/core/docs/r0.20.0/hdfs_design.html#Safemode




 3. If no mapper more mapper functions can be run on the node that
 contains the data on which the mapper has to act on, is Hadoop
 intelligent enough to run the new mappers on some machines within the
 same rack?


 Yes, assuming you have configured a network topology script. Otherwise,
 Hadoop has no magical knowledge of your network infrastructure, and it
 treats the whole cluster as a single rack called /default-rack


 Is it a network topology script or is it a Java plugin code? AFAIK, we
 need to write an implementation of
 org.apache.hadoop.net.DNSToSwitchMapping interface. Can we write it as
 a script or configuration file and avoid Java coding to achieve this?
 If so, how?


To tell Hadoop about your network topology you can either write a Java
implementation of org.apache.hadoop.net.DNSToSwitchMapping or you can
write a script in another language. There are more details at
http://hadoop.apache.org/core/docs/r0.20.0/cluster_setup.html#Hadoop+Rack+Awareness
and a sample script at
http://www.nabble.com/Hadoop-topology.script.file.name-Form-td17683521.html


Re: move tasks to another machine on the fly

2009-05-06 Thread Tom White
Hi David,

The MapReduce framework will attempt to rerun failed tasks
automatically. However, if a task is running out of memory on one
machine, it's likely to run out of memory on another, isn't it? Have a
look at the mapred.child.java.opts configuration property for the
amount of memory that each task VM is given (200MB by default). You
can also control the memory that each daemon gets using the
HADOOP_HEAPSIZE variable in hadoop-env.sh. Or you can specify it on a
per-daemon basis using the HADOOP_DAEMON_NAME_OPTS variables in the
same file.

Tom

On Wed, May 6, 2009 at 1:28 AM, David Batista dsbati...@gmail.com wrote:
 I get this error when running Reduce tasks on a machine:

 java.lang.OutOfMemoryError: unable to create new native thread
        at java.lang.Thread.start0(Native Method)
        at java.lang.Thread.start(Thread.java:597)
        at 
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.(DFSClient.java:2591)
        at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:454)
        at 
 org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:190)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487)
        at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:387)
        at 
 org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:117)
        at 
 org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
        at 
 org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
        at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:410)

 is it possible to move a reduce task to other machine in the cluster on the 
 fly?

 --
 ./david



Re: Using multiple FileSystems in hadoop input

2009-05-06 Thread Tom White
Hi Ivan,

I haven't tried this combination, but I think it should work. If it
doesn't it should be treated as a bug.

Tom

On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote:
 Greetings to all,

 Could anyone suggest if Paths from different FileSystems can be used as
 input of Hadoop job?

 Particularly I'd like to find out whether Paths from HarFileSystem can be
 mixed with ones from DistributedFileSystem.

 Thanks,


 --
 Kind regards,
 Ivan



Re: multi-line records and file splits

2009-05-06 Thread Tom White
Hi Rajarshi,

FileInputFormat (SDFInputFormat's superclass) will break files into
splits, typically on HDFS block boundaries (if the defaults are left
unchanged). This is not a problem for your code however, since it will
read every record that starts within a split (even if it crosses a
split boundary). This is just like how TextInputFormat works. So you
don't need to use MultiFileInputFormat - it should work as is. You
could demonstrate this to yourself by writing a multi-block file, and
doing an identity MapReduce on it. You should find that no records are
lost.

You might be able to use
org.apache.hadoop.streaming.StreamXmlRecordReader (and
StreamInputFormat), which does something similar. Despite its name it
is not only for Streaming applications, and it isn't restricted to
XML. It can parse records that begin with a certain sequence of
characters, and end with another sequence.

Cheers,
Tom

On Wed, May 6, 2009 at 2:06 AM, Nick Cen cenyo...@gmail.com wrote:
 I think your SDFInputFormat should implement the MultiFileInputFormat
 instead of the TextInputFormat, which will not splid the file into chunk.

 2009/5/6 Rajarshi Guha rg...@indiana.edu

 Hi, I have implemented a subclass of RecordReader to handle a plain text
 file format where a record is multi-line and of variable length.
 Schematically each record is of the form

 some_title
 foo
 bar
 
 another_title
 foo
 foo
 bar
 

 where  is the marker for the end of the record. My code is at
 http://blog.rguha.net/?p=293 and it seems to work fine on my input data.

 However, I realized that when I run the program, Hadoop will 'chunk' the
 input file. As a result, the SDFRecordReader might get a chunk of input
 text, such that the last record is actually incomplete (a missing ). Is
 this correct?

 If so, how would the RecordReader implementation recover from this
 situation? Or is there a way to indicate to Hadoop that the input file
 should be chunked keeping in mind end of record delimiters?

 Thanks

 ---
 Rajarshi Guha  rg...@indiana.edu
 GPG Fingerprint: D070 5427 CC5B 7938 929C  DD13 66A1 922C 51E7 9E84
 ---
 Q:  What's polite and works for the phone company?
 A:  A deferential operator.





 --
 http://daily.appspot.com/food/



Re: Specifying System Properties in the had

2009-04-30 Thread Tom White
Another way to do this would be to set a property in the Hadoop config itself.

In the job launcher you would have something like:

JobConf conf = ...
conf.setProperty(foo, test);

Then you can read the property in your map or reduce task.

Tom

On Thu, Apr 30, 2009 at 3:25 PM, Aaron Kimball aa...@cloudera.com wrote:
 So you want a different -Dfoo=test on each node? It's probably grabbing
 the setting from the node where the job was submitted, and this overrides
 the settings on each task node.

 Try adding finaltrue/final to the property block on the tasktrackers,
 then restart Hadoop and try again. This will prevent the job from overriding
 the setting.

 - Aaron

 On Thu, Apr 30, 2009 at 9:25 AM, Marc Limotte mlimo...@feeva.com wrote:

 I'm trying to set a System Property in the Hadoop config, so my jobs will
 know which cluster they are running on.  I think I should be able to do this
 with -Dname=value in mapred.child.java.opts (example below), but the
 setting is ignored.
 In hadoop-site.xml I have:
 property
 namemapred.child.java.opts/name
 value-Xmx200m -Dfoo=test/value
 /property
 But the job conf through the web server indicates:
 mapred.child.java.opts -Xmx1024M -Duser.timezone=UTC

 I'm using Hadoop-0.17.2.1.
 Any tips on why my setting is not picked up?

 Marc

 
 PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR
 ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION
 PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE,
 DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY
 PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND
 PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.




Re: Patching and bulding produces no libcordio or libhdfs

2009-04-28 Thread Tom White
Have a look at the instructions on
http://wiki.apache.org/hadoop/HowToRelease under the Building
section. It tells you which environment settings and Ant targets you
need to set.

Tom

On Tue, Apr 28, 2009 at 9:09 AM, Sid123 itis...@gmail.com wrote:

 HI I have applied a small patch for version 0.20 to my old 0.19.1...
 After i ran the ant tar I found 3 directories libhdfs and libcodio and c++
 were missing from th tared build. Where do you get those from?
 I cant really use 0.20 because of massive library changes... So if some one
 can help me out here... Greatly appreciated..
 Here is the patch
 https://issues.apache.org/jira/secure/attachment/12397007/patch-4963.txt i
 applied
 --
 View this message in context: 
 http://www.nabble.com/Patching-and-bulding-produces-no-libcordio-or-libhdfs-tp23272263p23272263.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: How to run many jobs at the same time?

2009-04-21 Thread Tom White
You need to start each JobControl in its own thread so they can run
concurrently. Something like:

Thread t = new Thread(jobControl);
t.start();

Then poll the jobControl.allFinished() method.

Tom

On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr
nguyenhuynh...@gmail.com wrote:
 Hi all!


 I have some jobs: job1, job2, job3,... . Each job working with the
 group. To control jobs, I have JobControllers, each JobController
 control  jobs  follow the  specified  group.


 Example:

 - Have 2 Group: g1 and g2

 - 2 JobController: jController1, jcontroller2

  + jController1 contains jobs: job1, job2, job3, ...

  + jController2 contains jobs: job1, job2, job3, ...


 * To run jobs, I sue:

 for (i=0; i2; i++){

    jCtrl[i]= new jController(group i);

    jCtrl[i].run();

 }


 * I want jController1 and jController2 run parallel. But actual, when
 jController1 finished,  jController2 begin run.


 Why?

 Please help me!


 * P/s: jController use org.apache.hadoop.mapred.jobcontrol.JobControl


 Thanks,


 cheer,

 Nguyen.




Re: Interesting Hadoop/FUSE-DFS access patterns

2009-04-16 Thread Tom White
Not sure if will affect your findings, but when you read from a
FSDataInputStream you should see how many bytes were actually read by
inspecting the return value and re-read if it was fewer than you want.
See Hadoop's IOUtils readFully() method.

Tom

On Mon, Apr 13, 2009 at 4:22 PM, Brian Bockelman bbock...@cse.unl.edu wrote:

 Hey Todd,

 Been playing more this morning after thinking about it for the night -- I
 think the culprit is not the network, but actually the cache.  Here's the
 output of your script adjusted to do the same calls as I was doing (you had
 left out the random I/O part).

 [br...@red tmp]$ java hdfs_tester
 Mean value for reads of size 0: 0.0447
 Mean value for reads of size 16384: 10.4872
 Mean value for reads of size 32768: 10.82925
 Mean value for reads of size 49152: 6.2417
 Mean value for reads of size 65536: 7.0511003
 Mean value for reads of size 81920: 9.411599
 Mean value for reads of size 98304: 9.378799
 Mean value for reads of size 114688: 8.99065
 Mean value for reads of size 131072: 5.1378503
 Mean value for reads of size 147456: 6.1324
 Mean value for reads of size 163840: 17.1187
 Mean value for reads of size 180224: 6.5492
 Mean value for reads of size 196608: 8.45695
 Mean value for reads of size 212992: 7.4292
 Mean value for reads of size 229376: 10.7843
 Mean value for reads of size 245760: 9.29095
 Mean value for reads of size 262144: 6.57865

 Copy of the script below.

 So, without the FUSE layer, we don't see much (if any) patterns here.  The
 overhead of randomly skipping through the file is higher than the overhead
 of reading out the data.

 Upon further inspection, the biggest factor affecting the FUSE layer is
 actually the Linux VFS caching -- if you notice, the bandwidth in the given
 graph for larger read sizes is *higher* than 1Gbps, which is the limit of
 the network on that particular node.  If I go in the opposite direction -
 starting with the largest reads first, then going down to the smallest
 reads, the graph entirely smooths out for the small values - everything is
 read from the filesystem cache in the client RAM.  Graph attached.

 So, on the upside, mounting through FUSE gives us the opportunity to speed
 up reads for very complex, non-sequential patterns - for free, thanks to the
 hardworking Linux kernel.  On the downside, it's incredibly difficult to
 come up with simple cases to demonstrate performance for an application --
 the cache performance and size depends on how much activity there's on the
 client, the previous file system activity that the application did, and the
 amount of concurrent activity on the server.  I can give you results for
 performance, but it's not going to be the performance you see in real life.
  (Gee, if only file systems were easy...)

 Ok, sorry for the list noise -- it seems I'm going to have to think more
 about this problem before I can come up with something coherent.

 Brian





 import org.apache.hadoop.fs.FileSystem;
 import org.apache.hadoop.fs.FileStatus;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.fs.FSDataInputStream;
 import org.apache.hadoop.conf.Configuration;
 import java.io.IOException;
 import java.net.URI;
 import java.util.Random;

 public class hdfs_tester {
  public static void main(String[] args) throws Exception {
   URI uri = new URI(hdfs://hadoop-name:9000/);
   FileSystem fs = FileSystem.get(uri, new Configuration());
   Path path = new
 Path(/user/uscms01/pnfs/unl.edu/data4/cms/store/phedex_monarctest/Nebraska/LoadTest07_Nebraska_33);
   FSDataInputStream dis = fs.open(path);
   Random rand = new Random();
   FileStatus status = fs.getFileStatus(path);
   long file_len = status.getLen();
   int iters = 20;
   for (int size=0; size  1024*1024; size += 4*4096) {
     long csum = 0;
     for (int i = 0; i  iters; i++) {
       int pos = rand.nextInt((int)((file_len-size-1)/8))*8;
       byte buf[] = new byte[size];
       if (pos  0)
         pos = 0;
       long st = System.nanoTime();
       dis.read(pos, buf, 0, size);
       long et = System.nanoTime();
       csum += et-st;
       //System.out.println(String.valueOf(size) + \t + String.valueOf(pos)
 + \t + String.valueOf(et - st));
     }
     float csum2 = csum; csum2 /= iters;
     System.out.println(Mean value for reads of size  + size + :  +
 (csum2/1000/1000));
   }
   fs.close();
  }
 }


 On Apr 13, 2009, at 3:14 AM, Todd Lipcon wrote:

 On Mon, Apr 13, 2009 at 1:07 AM, Todd Lipcon t...@cloudera.com wrote:

 Hey Brian,

 This is really interesting stuff. I'm curious - have you tried these same
 experiments using the Java API? I'm wondering whether this is
 FUSE-specific
 or inherent to all HDFS reads. I'll try to reproduce this over here as
 well.

 This smells sort of nagle-related to me... if you get a chance, you may
 want to edit DFSClient.java and change TCP_WINDOW_SIZE to 256 * 1024, and
 see if the magic number jumps up to 256KB. If so, I think it should be a
 pretty easy bugfix.


 Oops - 

Re: Example of deploying jars through DistributedCache?

2009-04-08 Thread Tom White
Does it work if you use addArchiveToClassPath()?

Also, it may be more convenient to use GenericOptionsParser's -libjars option.

Tom

On Mon, Mar 2, 2009 at 7:42 AM, Aaron Kimball aa...@cloudera.com wrote:
 Hi all,

 I'm stumped as to how to use the distributed cache's classpath feature. I
 have a library of Java classes I'd like to distribute to jobs and use in my
 mapper; I figured the DCache's addFileToClassPath() method was the correct
 means, given the example at
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html.


 I've boiled it down to the following non-working example:

 in TestDriver.java:


  private void runJob() throws IOException {
    JobConf conf = new JobConf(getConf(), TestDriver.class);

    // do standard job configuration.
    FileInputFormat.addInputPath(conf, new Path(input));
    FileOutputFormat.setOutputPath(conf, new Path(output));

    conf.setMapperClass(TestMapper.class);
    conf.setNumReduceTasks(0);

    // load aaronTest2.jar into the dcache; this contains the class
 ValueProvider
    FileSystem fs = FileSystem.get(conf);
    fs.copyFromLocalFile(new Path(aaronTest2.jar), new
 Path(tmp/aaronTest2.jar));
    DistributedCache.addFileToClassPath(new Path(tmp/aaronTest2.jar),
 conf);

    // run the job.
    JobClient.runJob(conf);
  }


  and then in TestMapper:

  public void map(LongWritable key, Text value,
 OutputCollectorLongWritable, Text output,
      Reporter reporter) throws IOException {

    try {
      ValueProvider vp = (ValueProvider)
 Class.forName(ValueProvider).newInstance();
      Text val = vp.getValue();
      output.collect(new LongWritable(1), val);
    } catch (ClassNotFoundException e) {
      throw new IOException(not found:  + e.toString()); // newInstance()
 throws to here.
    } catch (Exception e) {
      throw new IOException(Exception: + e.toString());
    }
  }


 The class ValueProvider is to be loaded from aaronTest2.jar. I can verify
 that this code works if I put ValueProvider into the main jar I deploy. I
 can verify that aaronTest2.jar makes it into the
 ${mapred.local.dir}/taskTracker/archive/

 But when run with ValueProvider in aaronTest2.jar, the job fails with:

 $ bin/hadoop jar aaronTest1.jar TestDriver
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:03 INFO mapred.FileInputFormat: Total input paths to process
 : 10
 09/03/01 22:36:04 INFO mapred.JobClient: Running job: job_200903012210_0005
 09/03/01 22:36:05 INFO mapred.JobClient:  map 0% reduce 0%
 09/03/01 22:36:14 INFO mapred.JobClient: Task Id :
 attempt_200903012210_0005_m_00_0, Status : FAILED
 java.io.IOException: not found: java.lang.ClassNotFoundException:
 ValueProvider
    at TestMapper.map(Unknown Source)
    at TestMapper.map(Unknown Source)
    at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
    at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2207)


 Do I need to do something else (maybe in Mapper.configure()?) to actually
 classload the jar? The documentation makes me believe it should already be
 in the classpath by doing only what I've done above. I'm on Hadoop 0.18.3.

 Thanks,
 - Aaron



Re: RecordReader design heuristic

2009-03-18 Thread Tom White
Hi Josh,

The other aspect to think about when writing your own record reader is
input splits. As Jeff mentioned you really want mappers to be
processing about one HDFS block's worth of data. If your inputs are
significantly smaller, the overhead of creating mappers will be high
and your jobs will be inefficient. On the other hand, if your inputs
are significantly larger then you need to split them otherwise each
mapper will take a very long time processing each split. Some file
formats are inherently splittable, meaning you can re-align with
record boundaries from an arbitrary point in the file. Examples
include line-oriented text (split at newlines), and bzip2 (has a
unique block marker). If your format is splittable then you will be
able to take advantage of this to make MR processing more efficient.

Cheers,
Tom

On Wed, Mar 18, 2009 at 5:00 PM, Patterson, Josh jpatters...@tva.gov wrote:
 Jeff,
 Yeah, the mapper sitting on a dfs block is pretty cool.

 Also, yes, we are about to start crunching on a lot of energy smart grid
 data. TVA is sorta like Switzerland for smart grid power generation
 and transmission data across the nation. Right now we have about 12TB,
 and this is slated to be around 30TB by the end of next 2010 (possibly
 more, depending on how many more PMUs come online). I am very interested
 in Mahout and have read up on it, it has many algorithms that I am
 familiar with from grad school. I will be doing some very simple MR jobs
 early on like finding the average frequency for a range of data, and
 I've been selling various groups internally on what CAN be done with
 good data mining and tools like Hadoop/Mahout. Our production cluster
 wont be online for a few more weeks, but that part is already rolling so
 I've moved on to focus on designing the first jobs to find quality
 results/benefits that I can sell in order to campaign for more
 ambitious projects I have drawn up. I know time series data lends itself
 to many machine learning applications, so, yes, I would be very
 interested in talking to anyone who wants to talk or share notes on
 hadoop and machine learning. I believe Mahout can be a tremendous
 resource for us and definitely plan on running and contributing to it.

 Josh Patterson
 TVA

 -Original Message-
 From: Jeff Eastman [mailto:j...@windwardsolutions.com]
 Sent: Wednesday, March 18, 2009 12:02 PM
 To: core-user@hadoop.apache.org
 Subject: Re: RecordReader design heuristic

 Hi Josh,
 It seemed like you had a conceptual wire crossed and I'm glad to help
 out. The neat thing about Hadoop mappers is - since they are given a
 replicated HDFS block to munch on - the job scheduler has replication
 factor number of node choices where it can run each mapper. This means
 mappers are always reading from local storage.

 On another note, I notice you are processing what looks to be large
 quantities of vector data. If you have any interest in clustering this
 data you might want to look at the Mahout project
 (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
 clustering algorithms, including a new non-parametric Dirichlet Process
 Clustering implementation that I committed recently. We are pulling it
 all together for a 0.1 release and I would be very interested in helping

 you to apply these algorithms if you have an interest.

 Jeff


 Patterson, Josh wrote:
 Jeff,
 ok, that makes more sense, I was under the mis-impression that it was
 creating and destroying mappers for each input record. I dont know why I
 had that in my head. My design suddenly became a lot clearer, and this
 provides a much more clean abstraction. Thanks for your help!

 Josh Patterson
 TVA






Re: Problem with com.sun.pinkdots.LogHandler

2009-03-17 Thread Tom White
Hi Paul,

Looking at the stack trace, the exception is being thrown from your
map method. Can you put some debugging in there to diagnose it?
Detecting and logging the size of the array and the index you are
trying to access should help. You can write to standard error and look
in the task logs. Another way is to use Reporter's setStatus() method
as a quick way to see messages in the web UI.

Cheers,
Tom

On Mon, Mar 16, 2009 at 11:51 PM, psterk paul.st...@sun.com wrote:

 Hi,

 I have been running a hadoop cluster successfully for a few months.  During
 today's run, I am seeing a new error and it is not clear to me how to
 resolve it. Below are the stack traces and the configure file I am using.
 Please share any tips you may have.

 Thanks,
 Paul

 09/03/16 16:28:25 INFO mapred.JobClient: Task Id :
 task_200903161455_0003_m_000127_0, Status : FAILED
 java.lang.ArrayIndexOutOfBoundsException: 3
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 task_200903161455_0003_m_000127_0: Starting
 null.task_200903161455_0003_m_000127_0
 task_200903161455_0003_m_000127_0: Closing
 task_200903161455_0003_m_000127_0: log4j:WARN No appenders could be found
 for logger (org.apache.hadoop.mapred.TaskRu
 task_200903161455_0003_m_000127_0: log4j:WARN Please initialize the log4j
 system properly.
 09/03/16 16:28:27 INFO mapred.JobClient: Task Id :
 task_200903161455_0003_m_000128_0, Status : FAILED
 java.lang.ArrayIndexOutOfBoundsException: 3
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 task_200903161455_0003_m_000128_0: Starting
 null.task_200903161455_0003_m_000128_0
 task_200903161455_0003_m_000128_0: Closing
 09/03/16 16:28:32 INFO mapred.JobClient: Task Id :
 task_200903161455_0003_m_000128_1, Status : FAILED
 java.lang.ArrayIndexOutOfBoundsException: 3
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 task_200903161455_0003_m_000128_1: Starting
 null.task_200903161455_0003_m_000128_1
 task_200903161455_0003_m_000128_1: Closing
 09/03/16 16:28:37 INFO mapred.JobClient: Task Id :
 task_200903161455_0003_m_000127_1, Status : FAILED
 java.lang.ArrayIndexOutOfBoundsException: 3
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 :qsk_200903161455_0003_m_000127_1: Starting
 null.task_200903161455_0003_m_000127_1
 clear200903161455_0003_m_000127_1: Closing
 task_200903161455_0003_m_000127_1: log4j:WARN No appenders could be found
 for logger (org.apache.hadoop.ipc.Client).
 task_200903161455_0003_m_000127_1: log4j:WARN Please initialize the log4j
 system properly.
 09/03/16 16:28:40 INFO mapred.JobClient: Task Id :
 task_200903161455_0003_m_000128_2, Status : FAILED
 java.lang.ArrayIndexOutOfBoundsException: 3
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:71)
        at com.sun.pinkdots.LogHandler$Mapper.map(LogHandler.java:22)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
        at
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)

 task_200903161455_0003_m_000128_2: Starting
 null.task_200903161455_0003_m_000128_2
 task_200903161455_0003_m_000128_2: Closing
 09/03/16 16:28:46 INFO mapred.JobClient:  map 100% reduce 100%
 java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1062)
        at com.sun.pinkdots.Main.handleLogs(Main.java:63)
        at com.sun.pinkdots.Main.main(Main.java:35)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 

Re: contrib EC2 with hadoop 0.17

2009-03-05 Thread Tom White
I haven't used Eucalyptus, but you could start by trying out the
Hadoop EC2 scripts (http://wiki.apache.org/hadoop/AmazonEC2) with your
Eucalyptus installation.

Cheers,
Tom

On Tue, Mar 3, 2009 at 2:51 PM, falcon164 mujahid...@gmail.com wrote:

 I am new to hadoop. I want to run hadoop on eucalyptus. Please let me know
 how to do this.
 --
 View this message in context: 
 http://www.nabble.com/contrib-EC2-with-hadoop-0.17-tp17711758p22310068.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: Hadoop AMI for EC2

2009-03-05 Thread Tom White
Hi Richa,

Yes there is. Please see http://wiki.apache.org/hadoop/AmazonEC2.

Tom

On Thu, Mar 5, 2009 at 4:13 PM, Richa Khandelwal richa...@gmail.com wrote:
 Hi All,
 Is there an existing Hadoop AMI for EC2 which had Hadaoop setup on it?

 Thanks,
 Richa Khandelwal


 University Of California,
 Santa Cruz.
 Ph:425-241-7763



Re: MapReduce jobs with expensive initialization

2009-03-02 Thread Tom White
On any particular tasktracker slot, task JVMs are shared only between
tasks of the same job. When the job is complete the task JVM will go
away. So there is certainly no sharing between jobs.

I believe the static singleton approach outlined by Scott will work
since the map classes are in a single classloader (but I haven't
actually tried this).

Cheers,
Tom

On Mon, Mar 2, 2009 at 1:39 AM, jason hadoop jason.had...@gmail.com wrote:
 If you have to you can reach through all of the class loaders and find the
 instance of your singleton class that has the data loaded. It is awkward,
 and
 I haven't done this in java since the late 90's. It did work the last time I
 did it.


 On Sun, Mar 1, 2009 at 11:21 AM, Scott Carey sc...@richrelevance.comwrote:

 You could create a singleton class and reference the dictionary stuff in
 that.  You would probably want this separate from other classes as to
 control exactly what data is held on to for a long time and what is not.

 class Singleton {

 private static final _instance Singleton = new Singleton();

 private Singleton() {
  ... initialize here, only ever called once per classloader or JVM;
 }

 public Singleton getSingleton() {
 return _instance;
 }

 in mapper:

 Singleton dictionary = Singleton.getSingleton();

 This assumes that each mapper doesn't live in its own classloader space
 (which would make even static singletons not shareable), and has the
 drawback that once initialized, that memory associated with the singleton
 won't go away until the JVM or classloader that hosts it dies.

 I have not tried this myself, and do not know the exact classloader
 semantics used in the new 'persistent' task JVMs.  They could have a
 classloader per job, and dispose of those when the job is complete -- though
 then it is impossible to persist data across jobs but only within them.  Or
 there could be one permanent persisted classloader, or one per task.   All
 will behave differently with respect to statics like the above example.

 
 From: Stuart White [stuart.whi...@gmail.com]
 Sent: Saturday, February 28, 2009 6:06 AM
 To: core-user@hadoop.apache.org
 Subject: MapReduce jobs with expensive initialization

 I have a mapreduce job that requires expensive initialization (loading
 of some large dictionaries before processing).

 I want to avoid executing this initialization more than necessary.

 I understand that I need to call setNumTasksToExecutePerJvm to -1 to
 force mapreduce to reuse JVMs when executing tasks.

 How I've been performing my initialization is, in my mapper, I
 override MapReduceBase#configure, read my parms from the JobConf, and
 load my dictionaries.

 It appears, from the tests I've run, that even though
 NumTasksToExecutePerJvm is set to -1, new instances of my Mapper class
 are being created for each task, and therefore I'm still re-running
 this expensive initialization for each task.

 So, my question is: how can I avoid re-executing this expensive
 initialization per-task?  Should I move my initialization code out of
 my mapper class and into my main class?  If so, how do I pass
 references to the loaded dictionaries from my main class to my mapper?

 Thanks!




Re: OutOfMemory error processing large amounts of gz files

2009-02-25 Thread Tom White
Do you experience the problem with and without native compression? Set
hadoop.native.lib to false to disable native compression.

Cheers,
Tom

On Tue, Feb 24, 2009 at 9:40 PM, Gordon Mohr goj...@archive.org wrote:
 If you're doing a lot of gzip compression/decompression, you *might* be
 hitting this 6+-year-old Sun JVM bug:

 Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not
 called promptly enough
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189

 A workaround is listed in the issue: ensuring you call close() or end() on
 the Deflater; something similar might apply to Inflater.

 (This is one of those fun JVM situations where having more heap space may
 make OOMEs more likely: less heap memory pressure leaves more un-GCd or
 un-finalized heap objects around, each of which is holding a bit of native
 memory.)

 - Gordon @ IA

 bzheng wrote:

 I have about 24k gz files (about 550GB total) on hdfs and has a really
 simple
 java program to convert them into sequence files.  If the script's
 setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
 error at about 35% map complete.  If I make the script process 2k files
 per
 job and run 12 jobs consecutively, then it goes through all files fine.
  The
 cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
 map, and max 2 reduce.

 The map task is really simple, it takes LongWritable as key and Text as
 value, generate a Text newKey, and output.collect(Text newKey, Text
 value). It doesn't have any code that can possibly leak memory.

 There's no stack trace for the vast majority of the OutOfMemory error,
 there's just a single line in the log like this:

 2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
 java.lang.OutOfMemoryError: Java heap space

 I can't find the stack trace right now, but rarely the OutOfMemory error
 originates from some hadoop config array copy opertaion.  There's no
 special
 config for the script.



Re: Reporter for Hadoop Streaming?

2009-02-11 Thread Tom White
You can retrieve them from the command line using

bin/hadoop job -counter job-id group-name counter-name

Tom

On Wed, Feb 11, 2009 at 12:20 AM, scruffy323 steve.mo...@gmail.com wrote:

 Do you know how to access those counters programmatically after the job has
 run?


 S D-5 wrote:

 This does it. Thanks!

 On Thu, Feb 5, 2009 at 9:14 PM, Arun C Murthy a...@yahoo-inc.com wrote:


 On Feb 5, 2009, at 1:40 PM, S D wrote:

  Is there a way to use the Reporter interface (or something similar such
 as
 Counters) with Hadoop streaming? Alternatively, can how could STDOUT be
 intercepted for the purpose of updates? If anyone could point me to
 documentation or examples that cover this I'd appreciate it.



 http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+counters+in+streaming+applications%3F

 http://hadoop.apache.org/core/docs/current/streaming.html#How+do+I+update+status+in+streaming+applications%3F

 Arun




 --
 View this message in context: 
 http://www.nabble.com/Reporter-for-Hadoop-Streaming--tp21861786p21945843.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: can't read the SequenceFile correctly

2009-02-06 Thread Tom White
Hi Mark,

Not all the bytes stored in a BytesWritable object are necessarily
valid. Use BytesWritable#getLength() to determine how much of the
buffer returned by BytesWritable#getBytes() to use.

Tom

On Fri, Feb 6, 2009 at 5:41 AM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,

 I have written binary files to a SequenceFile, seemeingly successfully, but
 when I read them back with the code below, after a first few reads I get the
 same number of bytes for the different files. What could go wrong?

 Thank you,
 Mark

  reader = new SequenceFile.Reader(fs, path, conf);
Writable key = (Writable)
 ReflectionUtils.newInstance(reader.getKeyClass(), conf);
Writable value = (Writable)
 ReflectionUtils.newInstance(reader.getValueClass(), conf);
long position = reader.getPosition();
while (reader.next(key, value)) {
String syncSeen = reader.syncSeen() ? * : ;
byte [] fileBytes = ((BytesWritable) value).getBytes();
System.out.printf([%s%s]\t%s\t%s\n, position, syncSeen,
 key, fileBytes.length);
position = reader.getPosition(); // beginning of next record
}



Re: Problem with Counters

2009-02-05 Thread Tom White
Hi Sharath,

The code you posted looks right to me. Counters#getCounter() will
return the counter's value. What error are you getting?

Tom

On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com wrote:
 Hi,

 Can someone help me with the usage of counters please? I am incrementing a
 counter in Reduce method but I am unable to collect the counter value after
 the job is completed.

 Its something like this:

 public static class Reduce extends MapReduceBase implements ReducerText,
 FloatWritable, Text, FloatWritable
{
static enum MyCounter{ct_key1};

 public void reduce(..) throws IOException
{

reporter.incrCounter(MyCounter.ct_key1, 1);

output.collect(..);

}
 }

 -main method
 {
RunningJob running = null;
running=JobClient.runJob(conf);

Counters ct = running.getCounters();
 /*  How do I Collect the ct_key1 value ***/
long res = ct.getCounter(MyCounter.ct_key1);

 }





 Thanks,

 Sharath



Re: Problem with Counters

2009-02-05 Thread Tom White
Try moving the enum to inside the top level class (as you already did)
and then use getCounter() passing the enum value:

public class MyJob {

  static enum MyCounter{ct_key1};

  // Mapper and Reducer defined here

  public static void main(String[] args) throws IOException {
// ...
RunningJob running =JobClient.runJob(conf);
Counters ct = running.getCounters();
long res = ct.getCounter(MyCounter.ct_key1);
// ...
  }

}

BTW org.apache.hadoop.mapred.Task$Counter is a built-in MapReduce
counter, so that won't help you retrieve your custom counter.

Cheers,

Tom

On Thu, Feb 5, 2009 at 2:22 PM, Rasit OZDAS rasitoz...@gmail.com wrote:
 Sharath,

 You're using  reporter.incrCounter(enumVal, intVal);  to increment counter,
 I think method to get should also be similar.

 Try to use findCounter(enumVal).getCounter() or  getCounter(enumVal).

 Hope this helps,
 Rasit

 2009/2/5 some speed speed.s...@gmail.com:
 In fact I put the enum in my Reduce method as the following link (from
 Yahoo) says so:

 http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html#metrics
 ---Look at the section under Reporting Custom Metrics.

 2009/2/5 some speed speed.s...@gmail.com

 Thanks Rasit.

 I did as you said.

 1) Put the static enum MyCounter{ct_key1} just above main()

 2) Changed  result =
 ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 1,
 Reduce.MyCounter).getCounter();

 Still is doesnt seem to help. It throws a null pointer exception.Its not
 able to find the Counter.



 Thanks,

 Sharath




 On Thu, Feb 5, 2009 at 8:04 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

 Forgot to say, value 0 means that the requested counter does not exist.

 2009/2/5 Rasit OZDAS rasitoz...@gmail.com:
  Sharath,
   I think the static enum definition should be out of Reduce class.
  Hadoop probably tries to find it elsewhere with MyCounter, but it's
  actually Reduce.MyCounter in your example.
 
  Hope this helps,
  Rasit
 
  2009/2/5 some speed speed.s...@gmail.com:
  I Tried the following...It gets compiled but the value of result seems
 to be
  0 always.
 
 RunningJob running = JobClient.runJob(conf);
 
  Counters ct = new Counters();
  ct = running.getCounters();
 
 long result =
  ct.findCounter(org.apache.hadoop.mapred.Task$Counter, 0,
  *MyCounter*).getCounter();
  //even tried MyCounter.Key1
 
 
 
  Does anyone know whay that is happening?
 
  Thanks,
 
  Sharath
 
 
 
  On Thu, Feb 5, 2009 at 5:59 AM, some speed speed.s...@gmail.com
 wrote:
 
  Hi Tom,
 
  I get the error :
 
  Cannot find Symbol* **MyCounter.ct_key1  *
 
 
 
 
 
 
  On Thu, Feb 5, 2009 at 5:51 AM, Tom White t...@cloudera.com wrote:
 
  Hi Sharath,
 
  The code you posted looks right to me. Counters#getCounter() will
  return the counter's value. What error are you getting?
 
  Tom
 
  On Thu, Feb 5, 2009 at 10:09 AM, some speed speed.s...@gmail.com
 wrote:
   Hi,
  
   Can someone help me with the usage of counters please? I am
 incrementing
  a
   counter in Reduce method but I am unable to collect the counter
 value
  after
   the job is completed.
  
   Its something like this:
  
   public static class Reduce extends MapReduceBase implements
  ReducerText,
   FloatWritable, Text, FloatWritable
  {
  static enum MyCounter{ct_key1};
  
   public void reduce(..) throws IOException
  {
  
  reporter.incrCounter(MyCounter.ct_key1, 1);
  
  output.collect(..);
  
  }
   }
  
   -main method
   {
  RunningJob running = null;
  running=JobClient.runJob(conf);
  
  Counters ct = running.getCounters();
   /*  How do I Collect the ct_key1 value ***/
  long res = ct.getCounter(MyCounter.ct_key1);
  
   }
  
  
  
  
  
   Thanks,
  
   Sharath
  
 
 
 
 
 
 
 
  --
  M. Raşit ÖZDAŞ
 



 --
 M. Raşit ÖZDAŞ







 --
 M. Raşit ÖZDAŞ



Re: SequenceFiles, checkpoints, block size (Was: How to flush SequenceFile.Writer?)

2009-02-03 Thread Tom White
Hi Brian,

Writes to HDFS are not guaranteed to be flushed until the file is
closed. In practice, as each (64MB) block is finished it is flushed
and will be visible to other readers, which is what you were seeing.

The addition of appends in HDFS changes this and adds a sync() method
to FSDataOutputStream. You can read about the semantics of the new
operations here:
https://issues.apache.org/jira/secure/attachment/12370562/Appends.doc.
Unfortunately, there are some problems with sync() that are still
being worked through
(https://issues.apache.org/jira/browse/HADOOP-4379). Also, even with
sync() working, the append() on SequenceFile does not do an implicit
sync() - it is not atomic. Furthermore, there is no way to get hold of
the FSDataOutputStream to call sync() yourself - see
https://issues.apache.org/jira/browse/HBASE-1155. (And don't get
confused by the sync() method on SequenceFile.Writer - it is for
another purpose entirely.)

As Jason points out, the simplest way to achieve what you're trying to
so is to close the file and start a new one. If you start to get too
many small files, then you can have another process to merge the
smaller files in the background.

Tom

On Tue, Feb 3, 2009 at 3:57 AM, jason hadoop jason.had...@gmail.com wrote:
 If you have to do a time based solution, for now, simply close the file and
 stage it, then open a new file.
 Your reads will have to deal with the fact the file is in multiple parts.
 Warning: Datanodes get pokey if they have large numbers of blocks, and the
 quickest way to do this is to create a lot of small files.

 On Mon, Feb 2, 2009 at 9:54 AM, Brian Long br...@dotspots.com wrote:

 Let me rephrase this problem... as stated below, when I start writing to a
 SequenceFile from an HDFS client, nothing is visible in HDFS until I've
 written 64M of data. This presents three problems: fsck reports the file
 system as corrupt until the first block is finally written out, the
 presence
 of the file (without any data) seems to blow up my mapred jobs that try to
 make use of it under my input path, and finally, I want to basically flush
 every 15 minutes or so so I can mapred the latest data.
 I don't see any programmatic way to force the file to flush in 17.2.
 Additionally, dfs.checkpoint.period does not seem to be obeyed. Does that
 not do what I think it does? What controls the 64M limit, anyway? Is it
 dfs.checkpoint.size or dfs.block.size? Is the buffering happening on
 the
 client, or on data nodes? Or in the namenode?

 It seems really bad that a SequenceFile, upon creation, is in an unusable
 state from the perspective of a mapred job, and also leaves fsck in a
 corrupt state. Surely I must be doing something wrong... but what? How can
 I
 ensure that a SequenceFile is immediately usable (but empty) on creation,
 and how can I make things flush on some regular time interval?

 Thanks,
 Brian


 On Thu, Jan 29, 2009 at 4:17 PM, Brian Long br...@dotspots.com wrote:

  I have a SequenceFile.Writer that I obtained via
 SequenceFile.createWriter
  and write to using append(key, value). Because the writer volume is low,
  it's not uncommon for it to take over a day for my appends to finally be
  flushed to HDFS (e.g. the new file will sit at 0 bytes for over a day).
  Because I am running map/reduce tasks on this data multiple times a day,
 I
  want to flush the sequence file so the mapred jobs can pick it up when
  they run.
  What's the right way to do this? I'm assuming it's a fairly common use
  case. Also -- are writes to the sequence files atomic? (e.g. if I am
  actively appending to a sequence file, is it always safe to read from
 that
  same file in a mapred job?)
 
  To be clear, I want the flushing to be time based (controlled explicitly
 by
  the app), not size based. Will this create waste in HDFS somehow?
 
  Thanks,
  Brian
 
 




Re: hadoop to ftp files into hdfs

2009-02-03 Thread Tom White
NLineInputFormat is ideal for this purpose. Each split will be N lines
of input (where N is configurable), so each mapper can retrieve N
files for insertion into HDFS. You can set the number of redcers to
zero.

Tom

On Tue, Feb 3, 2009 at 4:23 AM, jason hadoop jason.had...@gmail.com wrote:
 If you have a large number of ftp urls spread across many sites, simply set
 that file to be your hadoop job input, and force the input split to be a
 size that gives you good distribution across your cluster.


 On Mon, Feb 2, 2009 at 3:23 PM, Steve Morin steve.mo...@gmail.com wrote:

 Does any one have a good suggestion on how to submit a hadoop job that
 will split the ftp retrieval of a number of files for insertion into
 hdfs?  I have been searching google for suggestions on this matter.
 Steve




Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
Is there any reason why it has to be a single SequenceFile? You could
write a local program to write several block compressed SequenceFiles
in parallel (to HDFS), each containing a portion of the files on your
PC.

Tom

On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Truly, I do not see any advantage to doing this, as opposed to writing
 (Java) code which will copy files to HDFS, because then tarring becomes my
 bottleneck. Unless I write code measure the file sizes and prepare pointers
 for multiple tarring tasks. It becomes pretty complex though, and I thought
 of something simple. I might as well accept that copying one hard drive to
 HDFS is not going to be parallelized.
 Mark

 On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
 f...@infochimps.orgwrote:

 Could you tar.bz2 them up (setting up the tar so that it made a few dozen
 files), toss them onto the HDFS, and use
 http://stuartsierra.com/2008/04/24/a-million-little-files
 to go into SequenceFile?

 This lets you preserve the originals and do the sequence file conversion
 across the cluster. It's only really helpful, of course, if you also want
 to
 prepare a .tar.bz2 so you can clear out the sprawl

 flip

 On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
 wrote:

  Hi,
 
  I am writing an application to copy all files from a regular PC to a
  SequenceFile. I can surely do this by simply recursing all directories on
  my
  PC, but I wonder if there is any way to parallelize this, a MapReduce
 task
  even. Tom White's books seems to imply that it will have to be a custom
  application.
 
  Thank you,
  Mark
 



 --
 http://www.infochimps.org
 Connected Open Free Data




Re: A record version mismatch occured. Expecting v6, found v32

2009-02-02 Thread Tom White
The SequenceFile format is described here:
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/SequenceFile.html.
The format of the keys and values depends on the serialization classes
used. For example, BytesWritable writes out the length of its byte
array followed by the actual bytes in the array (see the write()
method in BytesWritable).

Hope this helps.
Tom

On Mon, Feb 2, 2009 at 3:21 PM, Rasit OZDAS rasitoz...@gmail.com wrote:
 I tried to use SequenceFile.Writer to convert my binaries into Sequence
 Files,
 I read the binary data with FileInputStream, getting all bytes with
 reader.read(byte[])  , wrote it to a file with SequenceFile.Writer, with
 parameters NullWritable as key, BytesWritable as value. But the content
 changes,
 (I can see that by converting to Base64)

 Binary File:
 73 65 65 65 81 65 65 65 65 65 81 81 65 119 84 81 65 111 67 81 65 52 57 81 65
 103 54 81 65 65 97 81 65 65 65 81 ...

 Sequence File:
 73 65 65 65 65 69 65 65 65 65 65 65 65 69 66 65 65 77 66 77 81 103 67 103 67
 69 77 65 52 80 86 67 65 73 68 114 ...

 Thanks for any points..
 Rasit

 2009/2/2 Rasit OZDAS rasitoz...@gmail.com

 Hi,
 I tried to use SequenceFileInputFormat, for this I appended SEQ as first
 bytes of my binary files (with hex editor).
 but I get this exception:

 A record version mismatch occured. Expecting v6, found v32
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1460)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1428)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1417)
 at
 org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1412)
 at
 org.apache.hadoop.mapred.SequenceFileRecordReader.init(SequenceFileRecordReader.java:43)
 at
 org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:58)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
 at org.apache.hadoop.mapred.Child.main(Child.java:155)

 What could it be? Is it not enough just to add SEQ to binary files?
 I use Hadoop v.0.19.0 .

 Thanks in advance..
 Rasit


 different *version* of *Hadoop* between your server and your client.

 --
 M. Raşit ÖZDAŞ




 --
 M. Raşit ÖZDAŞ



Re: MapFile.Reader and seek

2009-02-02 Thread Tom White
You can use the get() method to seek and retrieve the value. It will
return null if the key is not in the map. Something like:

Text value = (Text) indexReader.get(from, new Text());
while (value != null  ...)

Tom

On Thu, Jan 29, 2009 at 10:45 PM, schnitzi
mark.schnitz...@fastsearch.com wrote:

 Greetings all...  I have a situation where I want to read a range of keys and
 values out of a MapFile.  So I have something like this:

MapFile.Reader indexReader = new MapFile.Reader(fs, path.toString(),
 configuration)
boolean seekSuccess = indexReader.seek(from);
boolean readSuccess = indexReader.next(keyValue, value);
while (readSuccess  ...)

 The problem seems to be that while seekSuccess is returning true, when I
 call next() to get the value there, it's returning the value *after* the key
 that I called seek() on.  So if, say, my keys are Text(id0) through
 Text(id9), and I seek for Text(id3), calling next() will return
 Text(id4) and its associated value, not Text(id3).

 I would expect next() to return the key/value at the seek location, not the
 one after it.  Am I doing something wrong?  Otherwise, what good is seek(),
 really?
 --
 View this message in context: 
 http://www.nabble.com/MapFile.Reader-and-seek-tp21737717p21737717.html
 Sent from the Hadoop core-user mailing list archive at Nabble.com.




Re: best way to copy all files from a file system to hdfs

2009-02-02 Thread Tom White
Yes. SequenceFile is splittable, which means it can be broken into
chunks, called splits, each of which can be processed by a separate
map task.

Tom

On Mon, Feb 2, 2009 at 3:46 PM, Mark Kerzner markkerz...@gmail.com wrote:
 No, no reason for a single file - just a little simpler to think about. By
 the way, can multiple MapReduce workers read the same SequenceFile
 simultaneously?

 On Mon, Feb 2, 2009 at 9:42 AM, Tom White t...@cloudera.com wrote:

 Is there any reason why it has to be a single SequenceFile? You could
 write a local program to write several block compressed SequenceFiles
 in parallel (to HDFS), each containing a portion of the files on your
 PC.

 Tom

 On Mon, Feb 2, 2009 at 3:24 PM, Mark Kerzner markkerz...@gmail.com
 wrote:
  Truly, I do not see any advantage to doing this, as opposed to writing
  (Java) code which will copy files to HDFS, because then tarring becomes
 my
  bottleneck. Unless I write code measure the file sizes and prepare
 pointers
  for multiple tarring tasks. It becomes pretty complex though, and I
 thought
  of something simple. I might as well accept that copying one hard drive
 to
  HDFS is not going to be parallelized.
  Mark
 
  On Sun, Feb 1, 2009 at 11:44 PM, Philip (flip) Kromer
  f...@infochimps.orgwrote:
 
  Could you tar.bz2 them up (setting up the tar so that it made a few
 dozen
  files), toss them onto the HDFS, and use
  http://stuartsierra.com/2008/04/24/a-million-little-files
  to go into SequenceFile?
 
  This lets you preserve the originals and do the sequence file conversion
  across the cluster. It's only really helpful, of course, if you also
 want
  to
  prepare a .tar.bz2 so you can clear out the sprawl
 
  flip
 
  On Sun, Feb 1, 2009 at 11:22 PM, Mark Kerzner markkerz...@gmail.com
  wrote:
 
   Hi,
  
   I am writing an application to copy all files from a regular PC to a
   SequenceFile. I can surely do this by simply recursing all directories
 on
   my
   PC, but I wonder if there is any way to parallelize this, a MapReduce
  task
   even. Tom White's books seems to imply that it will have to be a
 custom
   application.
  
   Thank you,
   Mark
  
 
 
 
  --
  http://www.infochimps.org
  Connected Open Free Data
 
 




Re: tools for scrubbing HDFS data nodes?

2009-01-29 Thread Tom White
Each datanode has a web page at
http://datanode:50075/blockScannerReport where you can see details
about the scans.

Tom

On Thu, Jan 29, 2009 at 7:29 AM, Raghu Angadi rang...@yahoo-inc.com wrote:
 Owen O'Malley wrote:

 On Jan 28, 2009, at 6:16 PM, Sriram Rao wrote:

 By scrub I mean, have a tool that reads every block on a given data
 node.  That way, I'd be able to find corrupted blocks proactively
 rather than having an app read the file and find it.

 The datanode already has a thread that checks the blocks periodically for
 exactly that purpose.

 since Hadoop 0.16.0. scans all the blocks every 3 weeks (by default,
 interval can be changed).

 Raghu.



Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Hi Brian,

The CAT_A and CAT_B keys will be processed by different reducer
instances, so they run independently and may run in any order. What's
the output that you're trying to get?

Cheers,
Tom

On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below, no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.




Re: Archive?

2009-01-22 Thread Tom White
Hi Mark,

The archives are listed on http://wiki.apache.org/hadoop/MailingListArchives

Tom

On Thu, Jan 22, 2009 at 3:41 PM, Mark Kerzner markkerz...@gmail.com wrote:
 Hi,
 is there an archive to the messages? I am a newcomer, granted, but google
 groups has all the discussion capabilities, and it has a searchable archive.
 It is strange to have just a mailing list. Am I missing something?

 Thank you,
 Mark



Re: Set the Order of the Keys in Reduce

2009-01-22 Thread Tom White
Reducers run independently and without knowledge of one another, so
you can't get one reducer to depend on the output of another. I think
having two jobs is the simplest way to achieve what you're trying to
do.

Tom

On Thu, Jan 22, 2009 at 3:48 PM, Brian MacKay
brian.mac...@medecision.com wrote:
 Hello Tom,

 Would like to apply some rules To CAT_A, then use the output of CAT_A to
 reduce CAT_B.   I'd rather not run two JOBS, so perhaps I need two
 reducers?


 First Reducer processes CAT_A, then when complete second reducer does
 CAT_B?

 I suppose this would accomplish the same thing?



 -Original Message-
 From: Tom White [mailto:t...@cloudera.com]
 Sent: Thursday, January 22, 2009 10:41 AM
 To: core-user@hadoop.apache.org
 Subject: Re: Set the Order of the Keys in Reduce

 Hi Brian,

 The CAT_A and CAT_B keys will be processed by different reducer
 instances, so they run independently and may run in any order. What's
 the output that you're trying to get?

 Cheers,
 Tom

 On Thu, Jan 22, 2009 at 3:25 PM, Brian MacKay
 brian.mac...@medecision.com wrote:
 Hello,



 Any tips would be greatly appreciated.



 Is there a way to set the order of the keys in reduce as shown below,
 no
 matter what order the collection in MAP occurs in.



 Thanks, Brian





public void map(WritableComparable key, Text values,

OutputCollectorText, Text output, Reporter reporter)
 throws IOException {



//collect many CAT_A and CAT_B in random order

output.collect(CAT_A, details);

output.collect(CAT_B, details);



 }







   public void reduce(Text key, IteratorText values,

OutputCollectorText, Text output, Reporter
 reporter) throws IOException {



//always reduce CAT_A first, then reduce CAT_B



  }

 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
 _ _ _

 The information transmitted is intended only for the person or entity
 to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of,
 or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you
 received
 this message in error, please contact the sender and delete the
 material
 from any computer.



 _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

 The information transmitted is intended only for the person or entity to
 which it is addressed and may contain confidential and/or privileged
 material. Any review, retransmission, dissemination or other use of, or
 taking of any action in reliance upon, this information by persons or
 entities other than the intended recipient is prohibited. If you received
 this message in error, please contact the sender and delete the material
 from any computer.





Re: Why does Hadoop need ssh access to master and slaves?

2009-01-21 Thread Tom White
Hi Matthias,

It is not necessary to have SSH set up to run Hadoop, but it does make
things easier. SSH is used by the scripts in the bin directory which
start and stop daemons across the cluster (the slave nodes are defined
in the slaves file), see the start-all.sh script as a starting point.
These scripts are a convenient way to control Hadoop, but there are
other possibilities. If you had another system to control daemons on
your cluster then you wouldn't need SSH.

Tom

On Wed, Jan 21, 2009 at 1:20 PM, Matthias Scherer
matthias.sche...@1und1.de wrote:
 Hi Steve and Amit,

 Thanks for your answers. I agree with you that key-based ssh is nothing to 
 worry about. But I'm wondering what exactly - that means wich grid 
 administration tasks - hadoop does via ssh?! Does it restart crashed data 
 nodes or tasks trackers on the slaves? Oder does it transfer data over the 
 grid with ssh access? How can I find a short description what exactly hadoop 
 needs ssh for? The documentation says only that I have to configure it.

 Thanks  Regards
 Matthias


 -Ursprüngliche Nachricht-
 Von: Steve Loughran [mailto:ste...@apache.org]
 Gesendet: Mittwoch, 21. Januar 2009 13:59
 An: core-user@hadoop.apache.org
 Betreff: Re: Why does Hadoop need ssh access to master and slaves?

 Amit k. Saha wrote:
  On Wed, Jan 21, 2009 at 5:53 PM, Matthias Scherer
  matthias.sche...@1und1.de wrote:
  Hi all,
 
  we've made our first steps in evaluating hadoop. The setup
 of 2 VMs
  as a hadoop grid was very easy and works fine.
 
  Now our operations team wonders why hadoop has to be able
 to connect
  to the master and slaves via password-less ssh?! Can
 anyone give us
  an answer to this question?
 
  1. There has to be a way to connect to the remote hosts-
 slaves and a
  secondary master, and SSH is the secure way to do it 2. It
 has to be
  password-less to enable automatic logins
 

 SSH is *a * secure way to do it, but not the only way. Other
 management tools can bring up hadoop clusters. Hadoop ships
 with scripted support for SSH as it is standard with Linux
 distros and generally the best way to bring up a remote console.

 Matthias,
 Your ops team should not be worrying about the SSH security,
 as long as they keep their keys under control.

 (a) Key-based SSH is more secure than passworded SSH, as
 man-in-middle attacks are prevented. passphrase protected SSH
 keys on external USB keys even better.

 (b) once the cluster is up, that filesystem is pretty
 vulnerable to anything on the LAN. You do need to lock down
 your datacentre, or set up the firewall/routing of the
 servers so that only trusted hosts can talk to the FS. SSH
 becomes a detail at that point.






Re: @hadoop on twitter

2009-01-16 Thread Tom White
Thanks flip.

I've signed up for the hadoop account - be great to get some help with
getting it going.

Tom

On Wed, Jan 14, 2009 at 6:33 AM, Philip (flip) Kromer
f...@infochimps.org wrote:
 Hey all,
 There is no @hadoop on twitter, but there should be.
 http://twitter.com/datamapper and http://twitter.com/rails both set good
 examples.

 I'd be glad to either help get that going or to nod approvingly if someone
 on core does so.

 flip



Re: Re: getting null from CompressionCodecFactory.getCodec(Path file)

2009-01-14 Thread Tom White
LZO was removed due to license incompatibility:
https://issues.apache.org/jira/browse/HADOOP-4874

Tom

On Wed, Jan 14, 2009 at 11:18 AM, Gert Pfeifer
pfei...@se.inf.tu-dresden.de wrote:
 I got it. For some reason getDefaultExtension() returns .lzo_deflate.

 Is that a bug? Shouldn't it be .lzo?

 In the head revision I couldn't find it at all in
 http://svn.apache.org/repos/asf/hadoop/core/trunk/src/core/org/apache/hadoop/io/compress/

 There should be a Class LzoCodec.java. Was that moved to somewhere else?

 Gert

 Gert Pfeifer wrote:
 Arun C Murthy wrote:
 On Jan 13, 2009, at 7:29 AM, Gert Pfeifer wrote:

 Hi,
 I want to use an lzo file as input for a mapper. The record reader
 determines the codec using a CompressionCodecFactory, like this:

 (Hadoop version 0.19.0)

 http://hadoop.apache.org/core/docs/r0.19.0/native_libraries.html

 I should have mentioned that I have these native libs running:
 2009-01-14 10:00:21,107 INFO org.apache.hadoop.util.NativeCodeLoader:
 Loaded the native-hadoop library
 2009-01-14 10:00:21,111 INFO org.apache.hadoop.io.compress.LzoCodec:
 Successfully loaded  initialized native-lzo library

 Is that what you tried to point out with this link?

 Gert

 hth,
 Arun

 compressionCodecs = new CompressionCodecFactory(job);
 System.out.println(Using codecFactory: +compressionCodecs.toString());
 final CompressionCodec codec = compressionCodecs.getCodec(file);
 System.out.println(Using codec: +codec+ for file +file.getName());


 The output that I get is:

 Using codecFactory: { etalfed_ozl.:
 org.apache.hadoop.io.compress.LzoCodec }
 Using codec: null for file test.lzo

 Of course, the mapper does not work without codec. What could be the
 problem?

 Thanks,
 Gert



Re: Problem with Hadoop and concatenated gzip files

2009-01-12 Thread Tom White
I've opened https://issues.apache.org/jira/browse/HADOOP-5014 for this.

Do you get this behaviour when you use the native libraries?

Tom

On Sat, Jan 10, 2009 at 12:26 AM, Oscar Gothberg
oscar.gothb...@platform-a.com wrote:
 Hi,

 I'm having trouble with Hadoop (tested with 0.17 and 0.19) not fully 
 processing certain gzipped input files. Basically it only actually reads and 
 processes a first part of the gzipped file, and just ignores the rest without 
 any kind of warning.

 It affects at least (but is maybe not limited to?) any gzip files that are a 
 result of concatenation (which should be legal to do with gzip format):
 http://www.gnu.org/software/gzip/manual/gzip.html#Advanced-usage

 Repro case, using the WordCount example from the hadoop distribution:
 $ echo 'one two three'  f1
 $ echo 'four five six'  f2
 $ gzip -c f1  combined_file.gz
 $ gzip -c f2  combined_file.gz

 Now, if I run WordCount with combined_file.gz as input, it will only find 
 the words 'one', 'two', 'three', but not 'four', 'five', 'six'.

 It seems Java's GZIPInputStream may have a similar issue:
 http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4691425

 Now, if I unzip and re-gzip this 'combined_file.gz' manually, the problem 
 goes away.

 It's especially dangerous since Hadoop doesn't show any errors or complains 
 in the least. It just ignores this extra input. The only way of noticing is 
 to run one's app with gzipped- and unzipped data side by side and notice the 
 record counts being different.

 Is anyone else familiar with this problem? Any solutions, workarounds, short 
 of re-gzipping very large amounts of data?

 Thanks!
 / Oscar

 
 The information transmitted in this email is intended only for the person(s) 
 or entity to which it is addressed and may contain confidential and/or 
 privileged material. Any review, retransmission, dissemination or other use 
 of, or taking of any action in reliance upon, this information by persons or 
 entities other than the intended recipient is prohibited. If you received 
 this email in error, please contact the sender and permanently delete the 
 email from any computer.



Re: Concatenating PDF files

2009-01-05 Thread Tom White
Hi Richard,

Are you running out of memory after many PDFs have been processed by
one mapper, or during the first? The former would suggest that memory
isn't being released; the latter that the task VM doesn't have enough
memory to start with.

Are you setting the memory available to map tasks by setting
mapred.child.java.opts? You can try to see how much memory the
processes are using by logging into a machine when the job is running
and running 'top' or 'ps'.

It won't help the memory problems, but it sounds like you could run
with zero reducers for this job (conf.setNumReduceTasks(0)). Also, EC2
XL instances can run more than two tasks per node (they have 4 virtual
cores, see http://aws.amazon.com/ec2/instance-types/). And you should
configure them to take advantage of multiple disks -
https://issues.apache.org/jira/browse/HADOOP-4745.

Tom

On Fri, Jan 2, 2009 at 8:50 PM, Zak, Richard [USA] zak_rich...@bah.com wrote:
 All, I have a project that I am working on involving PDF files in HDFS.
 There are X number of directories and each directory contains Y number
 of PDFs, and per directory all the PDFs are to be concatenated.  At the
 moment I am running a test with 5 directories and 15 PDFs in each
 directory.  I am also using iText to handle the PDFs, and I wrote a
 wrapper class to take PDFs and add them to an internal PDF that grows. I
 am running this on Amazon's EC2 using Extra Large instances, which have
 a total of 15 GB RAM.  Each Java process, two per Instance, has 7GB
 maximum (-Xmx7000m).  There is one Master Instance and 4 Slave
 instances.  I am able to confirm that the Slave processes are connected
 to the Master and have been working.  I am using Hadoop 0.19.0.

 The problem is that I run out of memory when the concatenation class
 reads in a PDF.  I have tried both the iText library version 2.1.4 and
 the Faceless PDF library, and both have the error in the middle of
 concatenating the documents.  I looked into Multivalent, but that one
 just uses Strings to determine paths and it opens the files directly,
 while I am using a wrapper class to interact with items in HDFS, so
 Multivalent is out.

 Since the PDFs aren't enourmous (17 MB or less) and each Instance has
 tons of memory, so why am I running out of memory?

 The mapper works like this.  It gets a text file with a list of
 directories, and per directory it reads in the contents and adds them to
 the concatenation class.  The reducer pretty much does nothing.  Is this
 the best way to do this, or is there a better way?

 Thank you!

 Richard J. Zak




Re: Predefined counters

2008-12-22 Thread Tom White
Hi Jim,

Try something like:

Counters counters = job.getCounters();
counters.findCounter(org.apache.hadoop.mapred.Task$Counter,
REDUCE_INPUT_RECORDS).getCounter()

The pre-defined counters are unfortunately not public and are not in
one place in the source code, so you'll need to hunt to find them
(search the source for the counter name you see in the web UI). I
opened https://issues.apache.org/jira/browse/HADOOP-4043 a while back
to address the fact they are not public. Please consider voting for it
if you think it would be useful.

Cheers,
Tom

On Mon, Dec 22, 2008 at 2:47 AM, Jim Twensky jim.twen...@gmail.com wrote:
 Hello,
 I need to collect some statistics using some of the counters defined by the
 Map/Reduce framework such as Reduce input records. I know I should use
 the  getCounter method from Counters.Counter but I couldn't figure how to
 use it. Can someone give me a two line example of how to read the values for
 those counters and where I can find the names/groups of the predefined
 counters?

 Thanks in advance,
 Jim



Re: contrib/ec2 USER_DATA not used

2008-12-18 Thread Tom White
Hi Stefan,

The USER_DATA line is a hangover from the way that these parameters
used to be passed to the node. This line can safely be removed, since
the scripts now pass the data in the USER_DATA_FILE as you rightly
point out.

Tom

On Thu, Dec 18, 2008 at 10:09 AM, Stefan Groschupf s...@101tec.com wrote:
 Hi,

 can someone tell me what the variable USER_DATA in the launch-hadoop-master
 is all about.
 I cant see that it is reused in the script or any other script.
 Isnt the way those parameters are passed to the nodes the USER_DATA_FILE ?
 The line is:
 USER_DATA=MASTER_HOST=master,MAX_MAP_TASKS=$MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS
 Any hints?
 Thanks,
 Stefan
 ~~~
 Hadoop training and consulting
 http://www.scaleunlimited.com
 http://www.101tec.com






Re: EC2 Usage?

2008-12-18 Thread Tom White
Hi Ryan,

The ec2-describe-instances command in the API tool reports the launch
time for each instance, so you could work out the machine hours of
your cluster using that information.

Tom

On Thu, Dec 18, 2008 at 4:59 PM, Ryan LeCompte lecom...@gmail.com wrote:
 Hello all,

 Somewhat of a an off-topic related question, but I know there are
 Hadoop + EC2 users here. Does anyone know if there is a programmatic
 API to get find out how many machine time hours have been used by a
 Hadoop cluster (or anything) running on EC2? I know that you can log
 into the EC2 web site and see this, but I'm wondering if there's a way
 to access this data programmaticly via web services?

 Thanks,
 Ryan



Re: API Documentation question - WritableComparable

2008-12-16 Thread Tom White
I've opened https://issues.apache.org/jira/browse/HADOOP-4881 and
attached a patch to fix this.

Tom

On Fri, Dec 12, 2008 at 2:18 AM, Tarandeep Singh tarand...@gmail.com wrote:
 The example is just to illustrate how one should implement one's own
 WritableComparable class and in the compreTo method, it is just showing how
 it works in case of IntWritable with value as its member variable.

 You are right the example's code is misleading. It should have used  either
 timestamp or counter or both and not value.

 -Taran

 On Thu, Dec 11, 2008 at 3:55 PM, Andy Sautins
 andy.saut...@returnpath.netwrote:



  I have a question regarding the Hadoop API documentation for .19.  The
 question is in regard to:
 http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/Writ
 ableComparable.htmlhttp://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/WritableComparable.html.
  The document shows the following for the compareTo
 method:



   public int compareTo(MyWritableComparable w) {

 int thisValue = this.value;

 int thatValue = ((IntWritable)o).value;

 return (thisValue  thatValue ? -1 : (thisValue==thatValue ? 0
 : 1));

   }





   Taking the full class example doesn't compile.  What I _think_ would
 be right would be:



   public int compareTo(Object o) {

 int thisValue = this.value;

 int thatValue = ((MyWritableComparable)o).value;

 return (thisValue  thatValue ? -1 : (thisValue==thatValue ? 0
 : 1));

   }



  But even at that it's unclear why the compareTo function is comparing
 value ( which isn't a member of the class in the example ) and not the
 counter and timestamp variables in the class.



   Am I understanding this right?  Is there something amiss with the
 documentation?



   Thanks



   Andy









Re: When I system.out.println() in a map or reduce, where does it go?

2008-12-11 Thread Tom White
You can also see the logs from the web UI (http://jobtracker:50030
by default), by clicking through to the map or reduce task that you
are interested in and looking at the page for task attempts.

Tom

On Wed, Dec 10, 2008 at 10:41 PM, Tarandeep Singh [EMAIL PROTECTED] wrote:
 you can see the output in hadoop log directory (if you have used default
 settings, it would be $HADOOP_HOME/logs/userlogs)

 On Wed, Dec 10, 2008 at 1:31 PM, David Coe [EMAIL PROTECTED] wrote:

 I've noticed that if I put a system.out.println in the run() method I
 see the result on my console.  If I put it in the map or reduce class, I
 never see the result.  Where does it go?  Is there a way to get this
 result easily (eg dump it in a log file)?

 David




Re: Auto-shutdown for EC2 clusters

2008-11-26 Thread Tom White
I've just created a basic script to do something similar for running a
benchmark on EC2. See
https://issues.apache.org/jira/browse/HADOOP-4382. As it stands the
code for detecting when Hadoop is ready to accept jobs is simplistic,
to say the least, so any ideas for improvement would be great.

Thanks,
Tom

On Fri, Oct 24, 2008 at 11:53 PM, Chris K Wensel [EMAIL PROTECTED] wrote:

 fyi, the src/contrib/ec2 scripts do just what Paco suggests.

 minus the static IP stuff (you can use the scripts to login via cluster
 name, and spawn a tunnel for browsing nodes)

 that is, you can spawn any number of uniquely named, configured, and sized
 clusters, and you can increase their size independently as well. (shrinking
 is another matter altogether)

 ckw

 On Oct 24, 2008, at 1:58 PM, Paco NATHAN wrote:

 Hi Karl,

 Rather than using separate key pairs, you can use EC2 security groups
 to keep track of different clusters.

 Effectively, that requires a new security group for every cluster --
 so just allocate a bunch of different ones in a config file, then have
 the launch scripts draw from those. We also use EC2 static IP
 addresses and then have a DNS entry named similarly to each security
 group, associated with a static IP once that cluster is launched.
 It's relatively simple to query the running instances and collect them
 according to security groups.

 One way to handle detecting failures is just to attempt SSH in a loop.
 Our rough estimate is that approximately 2% of the attempted EC2 nodes
 fail at launch. So we allocate more than enough, given that rate.

 In a nutshell, that's one approach for managing a Hadoop cluster
 remotely on EC2.

 Best,
 Paco


 On Fri, Oct 24, 2008 at 2:07 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 23-Oct-08, at 10:01 AM, Paco NATHAN wrote:

 This workflow could be initiated from a crontab -- totally automated.
 However, we still see occasional failures of the cluster, and must
 restart manually, but not often.  Stability for that has improved much
 since the 0.18 release.  For us, it's getting closer to total
 automation.

 FWIW, that's running on EC2 m1.xl instances.

 Same here.  I've always had the namenode and web interface be accessible,
 but sometimes I don't get the slave nodes - usually zero slaves when this
 happens, sometimes I only miss one or two.  My rough estimate is that
 this
 happens 1% of the time.

 I currently have to notice this and restart manually.  Do you have a good
 way to detect it?  I have several Hadoop clusters running at once with
 the
 same AWS image and SSH keypair, so I can't count running instances.  I
 could
 have a separate keypair per cluster and count instances with that
 keypair,
 but I'd like to be able to start clusters opportunistically, with more
 than
 one cluster doing the same kind of job on different data.


 Karl Anderson
 [EMAIL PROTECTED]
 http://monkey.org/~kra





 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/




Google Terasort Benchmark

2008-11-22 Thread Tom White
From the Google Blog,
http://googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html

We are excited to announce we were able to sort 1TB (stored on the
Google File System as 10 billion 100-byte records in uncompressed text
files) on 1,000 computers in 68 seconds. By comparison, the previous
1TB sorting record [using Hadoop] is 209 seconds on 910 computers.

Something for the Hadoop community to aim for: a threefold performance increase.

Tom


Re: Hadoop Book

2008-09-16 Thread Tom White
The Rough Cut of the book is now available from
http://oreilly.com/catalog/9780596521998/index.html. There are a few
chapters available already, at various stages of completion. I'd love
to hear any suggestions for improvements that you may have. You can
give feedback on the Safari website where the book is hosted (please
don't post it on this thread). As the Rough Cuts FAQ explains
(http://oreilly.com/roughcuts/faq.csp), the most valuable feedback is
on missing topics, if something is not understandable, and technical
mistakes.

Thanks,

Tom

2008/9/4 叶双明 [EMAIL PROTECTED]:
 waiting for it!!!

 2008/9/5, Owen O'Malley [EMAIL PROTECTED]:


 On Sep 4, 2008, at 6:36 AM, 叶双明 wrote:

 what book?


 To summarize, Tom White is writing a book about Hadoop. He will post a
 message to the list when a draft is ready.

 -- Owen



Re: Parameterized deserializers?

2008-09-12 Thread Tom White
If you make your Serialization implement Configurable it will be given
a Configuration object that it can pass to the Deserializer on
construction.

Also, this thread may be related:
http://www.nabble.com/Serialization-with-additional-schema-info-td19260579.html

Tom

On Sat, Sep 13, 2008 at 12:38 AM, Pete Wyckoff [EMAIL PROTECTED] wrote:

 I should mention this is out of the context of SequenceFiles where we get
 the class names in the file itself. Here there is some information inserted
 into the JobConf that tells me the class of the records in the input file.


 -- pete


 On 9/12/08 3:26 PM, Pete Wyckoff [EMAIL PROTECTED] wrote:


 If I have a generic Serializer/Deserializers that take some runtime
 information to instantiate, how would this work in the current
 serializer/deserializer APIs? And depending on this runtime information, may
 return different Objects although they may all derive from the same class.

 For example, for Thrift, I may have something called a ThriftSerializer that
 is general:

 {code}
 Public class ThriftDeserializerT extends ThriftBase implements
 Deserializer {
   T deserialize(T);
 }
 {code}

 How would I instantiate this, since the current getDeserializer takes only
 the Class but not configuration object.
 How would I implement createKey in RecordReader


 In other words, I think we need a  {code}Class? getClass();  {code} method
 in Deserializer() and a {code}Deserializer getDeserializer(Class,
 Configuration conf); {code} method in Serializer.java.

 Or is there another way to do this?

 IF not, I can open a JIRA for implementing parameterized serializers.

 Thanks, pete







Re: Hadoop EC2

2008-09-04 Thread Tom White
On Thu, Sep 4, 2008 at 1:46 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 I'm noticing that using bin/hadoop fs -put ... svn://... is uploading
 multi-gigabyte files in ~64MB chunks.

That's because S3Filesystem stores files as 64MB blocks on S3.

 Then, when this is copied from
 S3 into HDFS using bin/hadoop distcp. Once the files are there and the
 job begins, it looks like it's breaking up the 4 multigigabyte text
 files into about 225 maps. Does this mean that each map is roughly
 processing 64MB of data each?

Yes, HDFS stores files as 64MB blocks too, and map input is split by
default so each map processes one block.

If so, is there any way to change this
 so that I can get my map tasks to process more data at a time? I'm
 curious if this will shorten the time it takes to run the program.

You could try increasing the HDFS block size. 128MB is actually
usually a better value, for this very reason.

In the future https://issues.apache.org/jira/browse/HADOOP-2560 will
help here too.


 Tom, in your article about Hadoop + EC2 you mention processing about
 100GB of logs in under 6 minutes or so.

In this article:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873,
it took 35 minutes to run the job. I'm planning on doing some
benchmarking on EC2 fairly soon, which should help us improve the
performance of Hadoop on EC2. It's worth remarking that this was
running on small instances. The larger instances perform a lot better
in my experience.

 Do you remember how many EC2
 instances you had running, and also how many map tasks did you have to
 operate on the 100GB? Was each map task handling about 1GB each?

I was running 20 nodes, and each map task was handling a HDFS block, 64MB.

Hope this helps,

Tom


Re: Reading and writing Thrift data from MapReduce

2008-09-03 Thread Tom White
Hi Juho,

I think you should be able to use the Thrift serialization stuff that
I've been working on in
https://issues.apache.org/jira/browse/HADOOP-3787 - at least as a
basis. Since you are not using sequence files, you will need to write
an InputFormat (probably one that extends FileInputFormat) and an
associated RecordReader that knows how to break the input into logical
records. See SequenceFileInputFormat for the kind of thing. Also,
since you store the Thrift type's name in the data, you can use a
variant of ThriftDeserializer that first reads the type name and
instantiates an instance of the type before reading its fields from
the stream.

Hope this helps.

Tom

On Wed, Sep 3, 2008 at 8:32 AM, Juho Mäkinen [EMAIL PROTECTED] wrote:
 Thanks Jeff. I believe that you mean the serde module inside hadoop
 (hadoop-core-trunk\src\contrib\hive\serde)?
 I'm currently looking into it, but it seems to lack a lot of useful
 documentation so it'll take me some time to figure it out (all
 additional info is appreciated).

 I've already put some effort into this and designed a partial
 sollution for my log analysis which so far seems ok to me. As I don't
 know the details of serde yet, I'm not sure if this is the way I
 should go, or should I change my implementation and plans so that I
 could use serve (if it makes my job easier). I'm not yet interested in
 HIVE, but I'd like to keep the option open in the future, so that I
 could easily run hive on my datas (so that I would not need to
 transform my datas to hive if I choose to use it in the future).

 Currently I've come up with the following design:
 1) Each log event type has it's own thrift structure. The structure is
 compiled into php code. The log entry creators creates and populates
 the structure php object with data and sends it to be stored
 2) Log sender object receiveres this object ($tbase) and serializes it
 using TBinaryTransport, adds the structure name to the beginning and
 sends the byte array to loc receiver using UDP. The following code
 does this:

 $this-transport = new TResetableMemoryBuffer(); // a TMemoryBuffer
 with a reset() method
 $this-protocol = new TBinaryProtocol($this-transport);
 $this-transport-open();

 $this-transport-reset(); // Reset the memory buffer array
 $this-protocol-writeByte(1); // version 1: we have the TBase name in string
 $this-protocol-writeString($tbase-getName()); // Name of the structure
 $tbase-write($this-protocol); // Serialize our thrift structure to
 the memory buffer

 $this-sendBytes($this-transport-getBuffer());

 3) Log receiver reads the structure name and stores the byte array
 (without the version byte and structure name) into HDFS file
 /events/insert structure name here/week
 number/timestamp.datafile

 My plan is that I could read the stored entries using MapReduce,
 deserialize them into java objects (the map-reducer would need to have
 the thrift compiled structures available) and use the structures
 directly in Map operations. (How) can serde help me with this part?
 Should I modify my plans so that I could use HIVE directly in the
 future? How Hive stores the thrift serialized log data into HDFS?

  - Juho Mäkinen

 On Wed, Sep 3, 2008 at 7:37 AM, Jeff Hammerbacher
 [EMAIL PROTECTED] wrote:
 Hey Juho,

 You should check out Hive
 (https://issues.apache.org/jira/browse/HADOOP-3601), which was just
 committed to the Hadoop trunk today. It's what we use at Facebook to
 query our collection of Thrift-serialized logfiles. Inside of the Hive
 code, you'll find a pure-Java (using JavaCC) parser for
 Thrift-serialized data structures.

 Regards,
 Jeff

 On Tue, Sep 2, 2008 at 6:57 AM, Stuart Sierra [EMAIL PROTECTED] wrote:
 On Tue, Sep 2, 2008 at 3:53 AM, Juho Mäkinen [EMAIL PROTECTED] wrote:
 What's the current status of Thrift with Hadoop? Is there any
 documentation online or even some code in the SVN which I could look
 into?

 I think you have two choices: 1) wrap your Thrift code in a class that
 implements Writable, or 2) use Thrift to serialize your data to byte
 arrays and store them as BytesWritable.
 -Stuart





Re: Error while uploading large file to S3 via Hadoop 0.18

2008-09-03 Thread Tom White
For the s3:// filesystem, files are split into 64MB blocks which are
sent to S3 individually. Rather than increase the jets3t.properties
retry buffer and retry count, it is better to change the Hadoop
properties fs.s3.maxRetries and fs.s3.sleepTimeSeconds, since the
Hadoop-level retry mechanism retries the whole block transfer, and the
block is stored on disk, so it doesn't consume memory. (The jets3t
mechanism is still useful for metadata operation retries.) See
https://issues.apache.org/jira/browse/HADOOP-997 for background.

Tom

On Tue, Sep 2, 2008 at 4:23 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Actually not if you're using the s3:// as opposed to s3n:// ...

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 11:21 AM, James Moore [EMAIL PROTECTED] wrote:
 On Mon, Sep 1, 2008 at 1:32 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello,

 I'm trying to upload a fairly large file (18GB or so) to my AWS S3
 account via bin/hadoop fs -put ... s3://...

 Isn't the maximum size of a file on s3 5GB?

 --
 James Moore | [EMAIL PROTECTED]
 Ruby and Ruby on Rails consulting
 blog.restphone.com




Re: Hadoop EC2

2008-09-03 Thread Tom White
There's a case study with some numbers in it from a presentation I
gave on Hadoop and AWS in London last month, which you may find
interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

This sounds very useful. Please consider creating a Jira and
submitting the code (even if it's not finished folks might like to
see it). Thanks.

Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan







Re: Hadoop Book

2008-09-03 Thread Tom White
Lukáš, Feris, I'll be sure to post a message to the list when the
book's available as a Rough Cut.

Tom

2008/8/28 Feris Thia [EMAIL PROTECTED]:
 Agree...

 I will be glad to be early notified about the release :)

 Regards,

 Feris

 2008/8/29 Lukáš Vlček [EMAIL PROTECTED]

 Tom,

 Do you think you could drop a small note into this list once it is
 available?

 Lukas





Re: Hadoop EC2

2008-09-03 Thread Tom White
On Wed, Sep 3, 2008 at 3:05 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Tom,

 I noticed that you mentioned using Amazon's new elastic block store as
 an alternative to using S3. Right now I'm testing pushing data to S3,
 then moving it from S3 into HDFS once the Hadoop cluster is up and
 running in EC2. It works pretty well -- moving data from S3 to HDFS is
 fast when the data in S3 is broken up into multiple files, since
 bin/hadoop distcp uses a Map/Reduce job to efficiently transfer the
 data.

Yes, this is a good-enough solution for many applications.


 Are there any real advantages to using the new elastic block store? Is
 moving data from the elastic block store into HDFS any faster than
 doing it from S3? Or can HDFS essentially live inside of the elastic
 block store?

Bandwidth between EBS and EC2 is better than between S3 and EC2, so if
you intend to run MapReduce on your data then you might consider
running an elastic Hadoop cluster that stores data on EBS-backed HDFS.
The nice thing is that you can shut down the cluster when you're not
using it and then restart it later. But if you have other applications
that need to access data from S3, then this may not be appropriate.
Also, it may not be as fast as HDFS using local disks for storage.

This is a new area, and I haven't done any measurements, so a lot of
this is conjecture on my part. Hadoop on EBS doesn't exist yet - but
it looks like a natural fit.


 Thanks!

 Ryan


 On Wed, Sep 3, 2008 at 9:54 AM, Tom White [EMAIL PROTECTED] wrote:
 There's a case study with some numbers in it from a presentation I
 gave on Hadoop and AWS in London last month, which you may find
 interesting: http://skillsmatter.com/custom/presentations/ec2-talk.pdf.

 tim robertson [EMAIL PROTECTED] wrote:
 For these small
 datasets, you might find it useful - let me know if I should spend
 time finishing it (Or submit help?) - it is really very simple.

 This sounds very useful. Please consider creating a Jira and
 submitting the code (even if it's not finished folks might like to
 see it). Thanks.

 Tom


 Cheers

 Tim



 On Tue, Sep 2, 2008 at 2:22 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hi Tim,

 Are you mostly just processing/parsing textual log files? How many
 maps/reduces did you configure in your hadoop-ec2-env.sh file? How
 many did you configure in your JobConf? Just trying to get an idea of
 what to expect in terms of performance. I'm noticing that it takes
 about 16 minutes to transfer about 15GB of textual uncompressed data
 from S3 into HDFS after the cluster has started with 15 nodes. I was
 expecting this to take a shorter amount of time, but maybe I'm
 incorrect in my assumptions. I am also noticing that it takes about 15
 minutes to parse through the 15GB of data with a 15 node cluster.

 Thanks,
 Ryan


 On Tue, Sep 2, 2008 at 3:29 AM, tim robertson [EMAIL PROTECTED] wrote:
 I have been processing only 100s GBs on EC2, not 1000's and using 20
 nodes and really only in exploration and testing phase right now.


 On Tue, Sep 2, 2008 at 8:44 AM, Andrew Hitchcock [EMAIL PROTECTED] 
 wrote:
 Hi Ryan,

 Just a heads up, if you require more than the 20 node limit, Amazon
 provides a form to request a higher limit:

 http://www.amazon.com/gp/html-forms-controller/ec2-request

 Andrew

 On Mon, Sep 1, 2008 at 10:43 PM, Ryan LeCompte [EMAIL PROTECTED] wrote:
 Hello all,

 I'm curious to see how many people are using EC2 to execute their
 Hadoop cluster and map/reduce programs, and how many are using
 home-grown datacenters. It seems like the 20 node limit with EC2 is a
 bit crippling when one wants to process many gigabytes of data. Has
 anyone found this to be the case? How much data are people processing
 with their 20 node limit on EC2? Curious what the thoughts are...

 Thanks,
 Ryan









Re: EC2 AMI for Hadoop 0.18.0

2008-09-03 Thread Tom White
I've just created public AMIs for 0.18.0. Note that they are in the
hadoop-images bucket.

Tom

On Fri, Aug 29, 2008 at 9:22 PM, Karl Anderson [EMAIL PROTECTED] wrote:

 On 29-Aug-08, at 6:49 AM, Stuart Sierra wrote:

 Anybody have one?  Any success building it with create-hadoop-image?
 Thanks,
 -Stuart

 I was able to build one following the instructions in the wiki.  You'll need
 to find the Java download url (see wiki) and put it and your own S3 bucket
 name in hadoop-ec2-env.sh.





Re: Hadoop Book

2008-08-28 Thread Tom White
That's right, I'm writing a book on Hadoop for O'Reilly. It will be a
part of the Rough Cuts program (http://oreilly.com/roughcuts/), which
means it'll be available as writing progresses.

Tom

2008/8/28 Lukáš Vlček [EMAIL PROTECTED]:
 BTW: I found (http://skillsmatter.com/custom/presentations/ec2-talk.pdf)
 that Tom White is working on Hadoop book now.

 Lukas

 2008/8/26 Feris Thia [EMAIL PROTECTED]

 Hi Lukas,

 I've check on Youtube.. and yes, there are many explanations on Hadoop.

 Thanks for your guide :)

 Regards,

 Feris

 On Tue, Aug 26, 2008 at 1:39 AM, Lukáš Vlček [EMAIL PROTECTED]
 wrote:

  Hi,
 
  As far as I know, there is no Hadoop specific book yet. However; you can
  find several interesting video presentations from Google or Yahoo! Hadoop
  meetings. There are good tutorials on the net as well as several
  interesting
  blog posts (sevearl people involved in Hadoop development do regularly
 blog
  about Hadoop) and you can read user and dev mail lists (and you can also
  ask
  questions there! - you can not do this with the book).
 
  On the other hand Hadoop is under development and as a such API can
 change
  and new fatures can be added every day. Hadoop is not settled down the
 same
  way the Oracle is now. But I am *sure* the book about Hadoop is comming
 in
  the future because there is a demand...
 
  Regards,
  Lukas
 
 
  --
  http://blog.lukas-vlcek.com/
 




 --
 http://blog.lukas-vlcek.com/



Re: Namenode Exceptions with S3

2008-07-17 Thread Tom White
On Thu, Jul 17, 2008 at 6:16 PM, Doug Cutting [EMAIL PROTECTED] wrote:
 Can't one work around this by using a different configuration on the client
 than on the namenodes and datanodes?  The client should be able to set
 fs.default.name to an s3: uri, while the namenode and datanode must have it
 set to an hdfs: uri, no?

Yes, that's a good solution.

 It might be less confusing if the HDFS daemons didn't use
 fs.default.name to define the namenode host and port. Just like
 mapred.job.tracker defines the host and port for the jobtracker,
 dfs.namenode.address (or similar) could define the namenode. Would
 this be a good change to make?

 Probably.  For back-compatibility we could leave it empty by default,
 deferring to fs.default.name, only if folks specify a non-empty
 dfs.namenode.address would it be used.

I've opened https://issues.apache.org/jira/browse/HADOOP-3782 for this.

Tom


Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
[EMAIL PROTECTED] wrote:
 Thank you, Tom.

 Forgive me for being dense, but I don't understand your reply:


Sorry! I'll try to explain it better (see below).


 Do you mean that it is possible to use the Hadoop daemons with S3 but
 the default filesystem must be HDFS?

The HDFS daemons use the value of fs.default.name to set the
namenode host and port, so if you set it to a s3 URI, you can't run
the HDFS daemons. So in this case you would use the start-mapred.sh
script instead of start-all.sh.

 If that is the case, can I
 specify the output filesystem on a per-job basis and can that be an S3
 FS?

Yes, that's exactly how you do it.


 Also, is there a particular reason to not allow S3 as the default FS?

You can allow S3 as the default FS, it's just that then you can't run
HDFS at all in this case. You would only do this if you don't want to
use HDFS at all, for example, if you were running a MapReduce job
which read from S3 and wrote to S3.

It might be less confusing if the HDFS daemons didn't use
fs.default.name to define the namenode host and port. Just like
mapred.job.tracker defines the host and port for the jobtracker,
dfs.namenode.address (or similar) could define the namenode. Would
this be a good change to make?

Tom


Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Fri, Jul 11, 2008 at 9:09 PM, slitz [EMAIL PROTECTED] wrote:
 a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket
  - PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode:
 java.lang.RuntimeException: Not a host:port pair: X

What command are you using to start Hadoop?

 b) Use HDFS as the default FS, specifying S3 only as input for the first Job
 and output for the last(assuming one has multiple jobs on same data)
  - PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733

Yes, this is a problem. I've added a comment to the Jira description
describing a workaround.

Tom


Re: Namenode Exceptions with S3

2008-07-10 Thread Tom White
 I get (where the all-caps portions are the actual values...):

 2008-07-01 19:05:17,540 ERROR org.apache.hadoop.dfs.NameNode:
 java.lang.NumberFormatException: For input string:
 [EMAIL PROTECTED]
at 
 java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Integer.parseInt(Integer.java:447)
at java.lang.Integer.parseInt(Integer.java:497)
at org.apache.hadoop.net.NetUtils.createSocketAddr(NetUtils.java:128)
at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:121)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:178)
at org.apache.hadoop.dfs.NameNode.init(NameNode.java:164)
at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:848)
at org.apache.hadoop.dfs.NameNode.main(NameNode.java:857)

 These exceptions are taken from the namenode log.  The datanode logs
 show the same exceptions.

If you make the default filesystem S3 then you can't run HDFS daemons.
If you want to run HDFS and use an S3 filesystem, you need to make the
default filesystem a hdfs URI, and use s3 URIs to reference S3
filesystems.

Hope this helps.

Tom


Re: Hadoop on EC2 + S3 - best practice?

2008-07-01 Thread Tom White
Hi Tim,

The steps you outline look about right. Because your file is 5GB you
will need to use the S3 block file system, which has a s3 URL. (See
http://wiki.apache.org/hadoop/AmazonS3) You shouldn't have to build
your own AMI unless you have dependencies that can't be submitted as a
part of the MapReduce job.

To read and write to S3 you can just use s3 URLs. Otherwise you can
use distcp to copy between S3 and HDFS before and after running your
job. This article I wrote has some more tips:
http://developer.amazonwebservices.com/connect/entry.jspa?externalID=873

Hope that helps,

Tom

On Sat, Jun 28, 2008 at 10:24 AM, tim robertson
[EMAIL PROTECTED] wrote:
 Hi all,
 I have data in a file (150million lines at 100Gb or so) and have several
 MapReduce classes for my processing (custom index generation).

 Can someone please confirm the following is the best way to run on EC2 and
 S3 (both of which I am new to..)

 1) load my 100Gb file into S3
 2) create a class that will load the file from S3 and use as input to
 mapreduce (S3 not used during processing) and save output back to S3
 3) create an AMI with the Hadoop + dependencies and my Jar file (loading the
 S3 input and the MR code) - I will base this on the public Hadoop AMI I
 guess
 4) run using the standard scripts

 Is this best practice?
 I assume this is pretty common... is there a better way where I can submit
 my Jar at runtime and just pass in the URL for the input and output files in
 S3?

 If not, has anyone an example that takes input from S3 and writes output to
 S3 also?

 Thanks for advice, or suggestions of best way to run.

 Tim



Re: hadoop on Solaris

2008-06-17 Thread Tom White
I've successfully run Hadoop on Solaris 5.10 (on Intel). The path
included /usr/ucb so whoami was picked up correctly.

Satoshi, you say you added /usr/ucb to you path too, so I'm puzzled
why you get a LoginException saying whoami: not found - did you
export your changes to path?

I've also managed to test and build Hadoop on Solaris. From 0.17
there's support for building the native libraries on Solaris, which
are useful for performance (see
https://issues.apache.org/jira/browse/HADOOP-3123).

Tom

On Tue, Jun 17, 2008 at 11:47 AM, Steve Loughran [EMAIL PROTECTED] wrote:
 Satoshi YAMADA wrote:

 From hadoop doc, only Linux and Windows are supported platforms. Is it
 possible to run
 hadoop on Solaris? Is hadoop implemented in pure java? What kinds of
 problems are there in
 order to port to Solaris? Thanks in advance.

 hi,

 no one seems to reply to the previous hadoop on Solaris Thread.

 I just tried running hadoop on Solaris 5.10 and somehow got error message.
 If you can give some advices, I would appreciate it. (single operation
 seems to
 work).


 You are probably the first person trying this. This means you have more
 work, but it gives you an opportunity to contribute code back into the next
 release.

 I'd recommend you check out the trunk and try building it and running the
 tests on solaris. Then when the tests fail, you can file bug reports (with
 stack traces) against specific tests. Then -possibly- other people might
 pick up and fix the problems, or you can fix them one by one, submitting
 patches to the bugreps as you go.

 I'm sure the Hadoop team would be happy to have Solaris support, its just a
 matter of whoever has the need sitting down to do it.

 -steve



Re: distcp/ls fails on Hadoop-0.17.0 on ec2.

2008-06-01 Thread Tom White
Hi Einar,

How did you put the data onto S3, using Hadoop's S3 FileSystem or
using other S3 tools? If it's the latter then it won't work as the s3
scheme is for Hadoop's block-based S3 storage. Native S3 support is
coming - see https://issues.apache.org/jira/browse/HADOOP-930, but
it's not integrated yet.

Tom

On Thu, May 29, 2008 at 10:15 PM, Einar Vollset
[EMAIL PROTECTED] wrote:
 Hi,

 I'm using the current Hadoop ec2 image (ami-ee53b687), and am having
 some trouble getting hadoop
 to access S3. Specifically, I'm trying to copy files from my bucket,
 into HDFS on the running cluster, so
 (on the master on the booted cluster) I do:

 hadoop-0.17.0 einar$ bin/hadoop distcp
 s3://ID:[EMAIL PROTECTED]/ input
 08/05/29 14:10:44 INFO util.CopyFiles: srcPaths=[
 s3://ID:[EMAIL PROTECTED]/]
 08/05/29 14:10:44 INFO util.CopyFiles: destPath=input
 08/05/29 14:10:46 WARN fs.FileSystem: localhost:9000 is a deprecated
 filesystem name. Use hdfs://localhost:9000/ instead.
 With failures, global counters are inaccurate; consider running with -i
 Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input
 source  s3://ID:[EMAIL PROTECTED]/ does not
 exist.
at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:578)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:594)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)

 ..which clearly doesn't work. The ID:SECRET are right - as if I change
 them I get :

 org.jets3t.service.S3ServiceException: S3 HEAD request failed.
 ResponseCode=403, ResponseMessage=Forbidden
 ..etc

 I suspect it might be a generic problem, as if I do:

 bin/hadoop fs -ls  s3://ID:[EMAIL PROTECTED]/

 I get:
 ls: Cannot access s3://ID:[EMAIL PROTECTED]/ :
 No such file or directory.


 ..even though the bucket is there and has a lot of data in it.


 Any thoughts?

 Cheers,

 Einar



Re: Hadoop 0.17 AMI?

2008-05-22 Thread Tom White
Hi Jeff,

I've built two public 0.17.0 AMIs (32-bit and 64-bit), so you should
be able to use the 0.17 scripts to launch them now.

Cheers,
Tom

On Thu, May 22, 2008 at 6:37 AM, Otis Gospodnetic
[EMAIL PROTECTED] wrote:
 Hi Jeff,

 0.17.0 was released yesterday, from what I can tell.


 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


 - Original Message 
 From: Jeff Eastman [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Wednesday, May 21, 2008 11:18:56 AM
 Subject: Re: Hadoop 0.17 AMI?

 Any word on 0.17? I was able to build an AMI from a trunk checkout and
 deploy a single node cluster but the create-hadoop-image-remote script
 really wants a tarball in the archive. I'd rather not waste time munging
 the scripts if a release is near.

 Jeff

 Nigel Daley wrote:
  Hadoop 0.17 hasn't been released yet.  I (or Mukund) is hoping to call
  a vote this afternoon or tomorrow.
 
  Nige
 
  On May 14, 2008, at 12:36 PM, Jeff Eastman wrote:
  I'm trying to bring up a cluster on EC2 using
  (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
  version to use because of the DNS improvements, etc. Unfortunately, I
  cannot find a public AMI with this build. Is there one that I'm not
  finding or do I need to create one?
 
  Jeff
 
 
 
 




Re: Hadoop 0.17 AMI?

2008-05-14 Thread Tom White
Hi Jeff,

There is no public 0.17 AMI yet - we need 0.17 to be released first.
So in the meantime you'll have to build your own.

Tom

On Wed, May 14, 2008 at 8:36 PM, Jeff Eastman
[EMAIL PROTECTED] wrote:
 I'm trying to bring up a cluster on EC2 using
 (http://wiki.apache.org/hadoop/AmazonEC2) and it seems that 0.17 is the
 version to use because of the DNS improvements, etc. Unfortunately, I
 cannot find a public AMI with this build. Is there one that I'm not
 finding or do I need to create one?

 Jeff




Re: Not able to back up to S3

2008-04-23 Thread Tom White
Part of the problem here is that the error message is confusing. It
looks like there's a problem with the AWS credentials, when in fact
the host name is malformed (but URI isn't telling us). I've created a
patch to make the error message more helpful:
https://issues.apache.org/jira/browse/HADOOP-3301.

Tom

On Fri, Apr 18, 2008 at 11:20 AM, Steve Loughran [EMAIL PROTECTED] wrote:
 Chris K Wensel wrote:

  you cannot have underscores in a bucket name. it freaks out java.net.URI.
 

  freaks out DNS, too, which is why the java.net classes whine. minus signs
 should work

  --
  Steve Loughran  http://www.1060.org/blogxter/publish/5
  Author: Ant in Action   http://antbook.org/



Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Tom White
Hi Siddhartha,

This is a problem in 0.16.1
(https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in
0.16.2, which was released yesterday.

Tom

On 04/04/2008, Siddhartha Reddy [EMAIL PROTECTED] wrote:
 I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on
  Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of
  CentOS 5 images (ami-08f41161).


  I am able to copy from hdfs to S3 using the following command:

  bin/hadoop distcp file.txt s3://id:[EMAIL PROTECTED]/file.txt


  But copying from S3 to hdfs with the following command fails:

  bin/hadoop distcp s3://id:[EMAIL PROTECTED]/file.txt file2.txt


  with the following error:

  With failures, global counters are inaccurate; consider running with -i
  Copy failed: java.lang.IllegalArgumentException: Hook previously registered
 at
  java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45)
 at java.lang.Runtime.addShutdownHook(Runtime.java:192)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
 at org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81)
 at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
 at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:175)
 at org.apache.hadoop.util.CopyFiles.checkSrcPath(CopyFiles.java:482)
 at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:504)
 at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:580)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:596)


  Can someone please point out if and what I am doing wrong?

  Thanks,

 Siddhartha Reddy



Re: distcp fails :Input source not found

2008-04-04 Thread Tom White
However, when I try it on 0.15.3, it doesn't allow a folder copy.
I have 100+ files in my S3 bucket, and I had to run distcp on each
one of them to get them on HDFS on EC2 . Not a nice experience!

This sounds like a bug - could you log a Jira issue for this please?

Thanks,
Tom


Re: S3 Support in 16.1

2008-03-31 Thread Tom White
Hi Jake,

Yes, this is a known issue that is fixed in 0.16.2 - see
https://issues.apache.org/jira/browse/HADOOP-3027.

Tom

On 31/03/2008, Jake Thompson [EMAIL PROTECTED] wrote:
 So I am new to hadoop, but everything is working well so far.
  Except.
  I can use S3 fs in 15.3 without a problem.

  However, if I try the same in 16.1 I get:
  Exception in thread main java.lang.IllegalArgumentException: Hook
  previously registered
 at
  java.lang.ApplicationShutdownHooks.add(ApplicationShutdownHooks.java:45)
 at java.lang.Runtime.addShutdownHook(Runtime.java:192)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1194)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
 at
  org.apache.hadoop.fs.s3.S3FileSystem.initialize(S3FileSystem.java:81)
 at
  org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1180)
 at org.apache.hadoop.fs.FileSystem.access$400(FileSystem.java:53)
 at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1197)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:148)
 at org.apache.hadoop.fs.FileSystem.getNamed(FileSystem.java:122)
 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:94)
 at org.apache.hadoop.fs.FsShell.init(FsShell.java:79)
 at org.apache.hadoop.fs.FsShell.run(FsShell.java:1567)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
 at org.apache.hadoop.fs.FsShell.main(FsShell.java:1704)


  Same config both place, any thoughts on this?  I checked out Jira and could
  not find a related bug report.


  -Jake



-- 
Blog: http://www.lexemetech.com/


Re: Map/reduce with input files on S3

2008-03-26 Thread Tom White
 I wonder if it is related to:
  https://issues.apache.org/jira/browse/HADOOP-3027


I think it is - the same problem is fixed for me when using HADOOP-3027.

Tom


Re: Input file globbing

2008-03-21 Thread Tom White
Thanks Hairong,

I've just created https://issues.apache.org/jira/browse/HADOOP-3064 for this.

Tom

On 20/03/2008, Hairong Kuang [EMAIL PROTECTED] wrote:
 Yes, this is a bug. This only occurs when a job's input path contains the
  closures. JobConf.getInputPaths interprets  mr/input/glob/2008/02/{02.08} as
  two input paths: mr/input/glob/2008/02/{02 and 08}. Let's see how to fix it.


  Hairong



  On 3/20/08 9:43 AM, Tom White [EMAIL PROTECTED] wrote:

   I'm trying to use file globbing to select various input paths, like so:
  
   conf.setInputPath(new Path(mr/input/glob/2008/02/{02,08}));
  
   But this gives an exception:
  
   Exception in thread main java.io.IOException: Illegal file pattern:
   Expecting set closure character or end of range, or } for glob {02 at
   3
   at org.apache.hadoop.fs.FileSystem$GlobFilter.error(FileSystem.java:1023)
   at 
 org.apache.hadoop.fs.FileSystem$GlobFilter.setRegex(FileSystem.java:1008)
   at org.apache.hadoop.fs.FileSystem$GlobFilter.init(FileSystem.java:926)
   at org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:826)
   at org.apache.hadoop.fs.FileSystem.globPaths(FileSystem.java:873)
   at
   
 org.apache.hadoop.mapred.FileInputFormat.validateInput(FileInputFormat.java:13
   1)
   at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:541)
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:809)
  
   Looking at the code for JobConf.getInputPaths I see it tokenizes using
   a comma as the delimiter, producing two paths
   mr/input/glob/2008/02/{02 and 08}. This looks like a bug to me.
   I'm surprised as this feature has been around for some time - are
   folks not using it like this?
  
   Tom




-- 
Blog: http://www.lexemetech.com/


Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Tom White
Yes, this isn't ideal for larger clusters. There's a jira to address
this: https://issues.apache.org/jira/browse/HADOOP-2410.

Tom

On 20/03/2008, Prasan Ary [EMAIL PROTECTED] wrote:
 Hi All,
   I have been trying to configure Hadoop on EC2 for large number of clusters 
 ( 100 plus). It seems that I have to copy EC2 private key to all the machines 
 in the cluster so that they can have SSH connections.
   For now it seems I have to run a script to copy the key file to each of the 
 EC2 instances. I wanted to know if there is a better way to accomplish this.

   Thanks,

   PA



  -
  Never miss a thing.   Make Yahoo your homepage.


-- 
Blog: http://www.lexemetech.com/


Re: Issue with cluster over EC2 and different AMI types

2008-03-19 Thread Tom White
Unfortunately there is no way to discover the rack that EC2 instances
are running on so you won't be able to use this optimization.

Tom

On 18/03/2008, Andrey Pankov [EMAIL PROTECTED] wrote:
 Hi,

  I'm apologize. It was my fault - I forgot to run tasktracker on slaves.
  But anyway can anyone share his experience how to use rack?
  Thanks.


  Andrey Pankov wrote:
   Hi all,
  
   I'm trying to configure Hadoop cluster over Amazon EC2, one m1.small
   instance for master node, and some m1.large instances for slaves. Both
   master's on slaves's AMIs have the same version of Hadoop, 0.16.0.
  
   I run ec2 instances using ec2-run-instances, with the same --group
   parameter, but in two step, one call - run for master, second call - run
   for slaves.
  
   It looks like EC2 instances with different AMI types starting in
   different networks, for example external and internal DNS names:
  
 * ec2-67-202-59-12.compute-1.amazonaws.com
   ip-10-251-74-181.ec2.internal - for small instance
 * ec2-67-202-3-191.compute-1.amazonaws.com
   domU-12-31-38-00-5C-C1.compute-1.internal - for large
  
   The trouble is that slaves could not contact the master. When I specify
   fs.default.name parameter in hadoop-site.xml on slaves box to be full
   DNS name of master (either external or internal) and try to start
   datanode on it (bin/hadoop-daemon.sh ... start datanode), Hadoop
   replaces fs.default.name with just 'ip-10-251-74-181' and puts in log
  
   2008-03-18 07:08:16,028 ERROR org.apache.hadoop.dfs.DataNode:
   java.net.UnknownHostException: unknown host: ip-10-251-74-181
   ...
  
   So DataNode could not be started.
  
   I tried to specify IP addr of ip-10-251-74-181 in /etc/hosts for each
   slave instance and it helped to start DataNode on slaves. And it became
   possible to store smth in HDFS. But. When I'm trying to run map-reduce
   job (in jar file), it doesn't work. I mean that jobs is still working
   but there is no any progress at all. Hadoop have written Map 0% Reduce
   0% and just freeze.
  
   Can not not find anything in logs what could help a bit, both on master
   and on slave boxes.
  
   I found that dfs.network.script could be used to specify somehow a
   network location for a machine, but have no ideas now to use it. Can
   racks help me with it?
  
   Thanks in advance.
  
   ---
   Andrey Pankov
  
  


 ---

 Andrey Pankov



-- 
Blog: http://www.lexemetech.com/


Re: Amazon S3 questions

2008-03-02 Thread Tom White
  One other note: When you use S3 URIs, you get a port out of range error
  on startup but that doesn't appear to be fatal.  I spent a few hours on that
  one before I realized it didn't seem to matter.  It seems like the S3 URI 
 format
  where ':' is used to separate ID and secret key is confusing someone.

Do you have a stacktrace for this? Sounds like something we could
improve, if only by printing a warning message.

Tom