date:20110524

Hello guys,

In case any of you are working on HBASE, I just wrote a program by reading
some tutorials..
But no where its mentioned how to run codes on HBASE. In case anyone of you
has done some coding on HBASE , can you please tell me how to run it.

I am able to compile my code by adding hbase-core.jar and hadoop-core.jar in
classpath while compiling it.
But not able to figure out how to run it.

Whenever I am doing java ExampleClient ( which is my Hbase program), I am
getting the following error :

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
at ExampleClient.main(ExampleClient.java:20)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 1 more
Thanks,
Praveenesh

AW: How to compile HBase code ?

2011-05-24 Thread Kleegrewe, Christian

How do you execute the client (command line) do you use the java or the hadoop 
command?
It seems that there is an error in your classpath when running the client job. 
The classpath when compiling classes that implement the client is different 
from the classpath when your client is executed since hadoop and hbase carry 
their own environment. Maybe tha following link helps: 

http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath


regards
Christian

---8

Siemens AG
Corporate Technology
Corporate Research and Technologies
CT T DE IT3
Otto-Hahn-Ring 6
81739 München, Deutschland
Tel.: +49 (89) 636-42722 
Fax: +49 (89) 636-41423 
mailto:christian.kleegr...@siemens.com 

Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; 
Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe 
Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; 
Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin 
Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322


-Ursprüngliche Nachricht-
Von: praveenesh kumar [mailto:praveen...@gmail.com] 
Gesendet: Dienstag, 24. Mai 2011 11:08
An: common-user@hadoop.apache.org
Betreff: How to compile HBase code ?

Hello guys,

In case any of you are working on HBASE, I just wrote a program by reading
some tutorials..
But no where its mentioned how to run codes on HBASE. In case anyone of you
has done some coding on HBASE , can you please tell me how to run it.

I am able to compile my code by adding hbase-core.jar and hadoop-core.jar in
classpath while compiling it.
But not able to figure out how to run it.

Whenever I am doing java ExampleClient ( which is my Hbase program), I am
getting the following error :

Exception in thread main java.lang.NoClassDefFoundError:
org/apache/hadoop/hbase/HBaseConfiguration
at ExampleClient.main(ExampleClient.java:20)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.hbase.HBaseConfiguration
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
... 1 more
Thanks,
Praveenesh

Re: How to compile HBase code ?

I am simply using HBase API, not doing any Map-reduce work on it.

Following is the code I have written , simply creating the file on HBase:

import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

public class ExampleClient {
 public static void main(String args []) throws IOException
 {
  HBaseConfiguration config = new HBaseConfiguration();

  HBaseAdmin admin = new HBaseAdmin(config);
  HTableDescriptor htd = new HTableDescriptor(test);
  HColumnDescriptor hcd = new HColumnDescriptor(data);
  htd.addFamily(hcd);
  admin.createTable(htd);

  byte [] tablename = htd.getName();
  HTableDescriptor [] tables = admin.listTables();

  if(tables.length !=1  Bytes.equals(tablename, tables[0].getName()))
  {
   throw new IOException(Failed to create table);
  }

  HTable table = new HTable(config,tablename);
  byte[] row1 = Bytes.toBytes(row1);
  Put p1 = new Put(row1);
  byte[] databytes = Bytes.toBytes(data);
  p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1));
  table.put(p1);

  Get g = new Get(row1);
  Result result = table.get(g);
  System.out.println(Get : + result);
  Scan scan = new Scan();
  ResultScanner scanner = table.getScanner(scan);
  try
  {
   for(Result scannerResult: scanner)
   {
System.out.println(Scan :  + scannerResult);
   }
  }catch(Exception e ){
   e.printStackTrace();
  }
  finally{
   scanner.close();
  }
  table.close();
 }
}

Now I have set the classpath variable in /etc/environment as
MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar

now I am compiling my code with javac command

*$javac -classpath $MYCLASSPATH ExampleClient.java*

It is working fine.
While running, I am using java command

*$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the
following error :
Exception in thread main java.lang.NoClassDefFoundError: ExampleClient
Caused by: java.lang.ClassNotFoundException: ExampleClient
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: ExampleClient. Program will exit.
But I am running the code from the same location. and ExampleClient.class
file exists at that location.





On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian 
christian.kleegr...@siemens.com wrote:

 How do you execute the client (command line) do you use the java or the
 hadoop command?
 It seems that there is an error in your classpath when running the client
 job. The classpath when compiling classes that implement the client is
 different from the classpath when your client is executed since hadoop and
 hbase carry their own environment. Maybe tha following link helps:


 http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath


 regards
 Christian

 ---8

 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 (89) 636-42722
 Fax: +49 (89) 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme;
 Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe
 Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y.
 Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland;
 Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684;
 WEEE-Reg.-Nr. DE 23691322


 -Ursprüngliche Nachricht-
 Von: praveenesh kumar [mailto:praveen...@gmail.com]
 Gesendet: Dienstag, 24. Mai 2011 11:08
 An: common-user@hadoop.apache.org
 Betreff: How to compile HBase code ?

 Hello guys,

 In case any of you are working on HBASE, I just wrote a program by reading
 some tutorials..
 But no where its mentioned how to run codes on HBASE. In case anyone of you
 has done some coding on HBASE , can you please tell me how to run it.

 I am able to compile my code by adding hbase-core.jar and hadoop-core.jar
 in
 classpath while compiling it.
 But not able to figure out how to run it.

 Whenever I am doing java ExampleClient ( which is my

AW: How to compile HBase code ?

2011-05-24 Thread Kleegrewe, Christian

Are you sure that the directory where your ExampleClient.class is locates is 
part of the MYCLASSPATH? 

regards
Christian

---8

Siemens AG
Corporate Technology
Corporate Research and Technologies
CT T DE IT3
Otto-Hahn-Ring 6
81739 München, Deutschland
Tel.: +49 (89) 636-42722 
Fax: +49 (89) 636-41423 
mailto:christian.kleegr...@siemens.com 

Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; 
Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe 
Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; 
Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin 
Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322


-Ursprüngliche Nachricht-
Von: praveenesh kumar [mailto:praveen...@gmail.com] 
Gesendet: Dienstag, 24. Mai 2011 11:54
An: common-user@hadoop.apache.org
Betreff: Re: How to compile HBase code ?

I am simply using HBase API, not doing any Map-reduce work on it.

Following is the code I have written , simply creating the file on HBase:

import java.io.IOException;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.HColumnDescriptor;
import org.apache.hadoop.hbase.HTableDescriptor;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

public class ExampleClient {
 public static void main(String args []) throws IOException
 {
  HBaseConfiguration config = new HBaseConfiguration();

  HBaseAdmin admin = new HBaseAdmin(config);
  HTableDescriptor htd = new HTableDescriptor(test);
  HColumnDescriptor hcd = new HColumnDescriptor(data);
  htd.addFamily(hcd);
  admin.createTable(htd);

  byte [] tablename = htd.getName();
  HTableDescriptor [] tables = admin.listTables();

  if(tables.length !=1  Bytes.equals(tablename, tables[0].getName()))
  {
   throw new IOException(Failed to create table);
  }

  HTable table = new HTable(config,tablename);
  byte[] row1 = Bytes.toBytes(row1);
  Put p1 = new Put(row1);
  byte[] databytes = Bytes.toBytes(data);
  p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1));
  table.put(p1);

  Get g = new Get(row1);
  Result result = table.get(g);
  System.out.println(Get : + result);
  Scan scan = new Scan();
  ResultScanner scanner = table.getScanner(scan);
  try
  {
   for(Result scannerResult: scanner)
   {
System.out.println(Scan :  + scannerResult);
   }
  }catch(Exception e ){
   e.printStackTrace();
  }
  finally{
   scanner.close();
  }
  table.close();
 }
}

Now I have set the classpath variable in /etc/environment as
MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar

now I am compiling my code with javac command

*$javac -classpath $MYCLASSPATH ExampleClient.java*

It is working fine.
While running, I am using java command

*$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the
following error :
Exception in thread main java.lang.NoClassDefFoundError: ExampleClient
Caused by: java.lang.ClassNotFoundException: ExampleClient
at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
Could not find the main class: ExampleClient. Program will exit.
But I am running the code from the same location. and ExampleClient.class
file exists at that location.





On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian 
christian.kleegr...@siemens.com wrote:

 How do you execute the client (command line) do you use the java or the
 hadoop command?
 It seems that there is an error in your classpath when running the client
 job. The classpath when compiling classes that implement the client is
 different from the classpath when your client is executed since hadoop and
 hbase carry their own environment. Maybe tha following link helps:


 http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath


 regards
 Christian

 ---8

 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 (89) 636-42722
 Fax: +49 (89) 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme;

Re: How to compile HBase code ?

2011-05-24 Thread Harsh J

Praveenesh,

HBase has their own user mailing lists where such queries ought to go.
Am moving the discussion to u...@hbase.apache.org and bcc-ing
common-user@ here. Also added you to cc.

Regarding your first error, going forward you can use the useful
`hbase classpath` to generate a HBase-provided classpath list for you
automatically. Something like:

$ MYCLASSPATH=`hbase classpath`

Regarding the second, latest one as below, your ExampleClient.class
isn't on the MYCLASSPATH (nor is the directory it is under, i.e. '.')
so Java can't really find it. This is not a HBase issue.

HTH.

On Tue, May 24, 2011 at 3:23 PM, praveenesh kumar praveen...@gmail.com wrote:
 I am simply using HBase API, not doing any Map-reduce work on it.

 Following is the code I have written , simply creating the file on HBase:

 import java.io.IOException;
 import org.apache.hadoop.hbase.HBaseConfiguration;
 import org.apache.hadoop.hbase.HColumnDescriptor;
 import org.apache.hadoop.hbase.HTableDescriptor;
 import org.apache.hadoop.hbase.client.Get;
 import org.apache.hadoop.hbase.client.HBaseAdmin;
 import org.apache.hadoop.hbase.client.HTable;
 import org.apache.hadoop.hbase.client.Put;
 import org.apache.hadoop.hbase.client.Result;
 import org.apache.hadoop.hbase.client.ResultScanner;
 import org.apache.hadoop.hbase.client.Scan;
 import org.apache.hadoop.hbase.util.Bytes;

 public class ExampleClient {
  public static void main(String args []) throws IOException
  {
  HBaseConfiguration config = new HBaseConfiguration();

  HBaseAdmin admin = new HBaseAdmin(config);
  HTableDescriptor htd = new HTableDescriptor(test);
  HColumnDescriptor hcd = new HColumnDescriptor(data);
  htd.addFamily(hcd);
  admin.createTable(htd);

  byte [] tablename = htd.getName();
  HTableDescriptor [] tables = admin.listTables();

  if(tables.length !=1  Bytes.equals(tablename, tables[0].getName()))
  {
   throw new IOException(Failed to create table);
  }

  HTable table = new HTable(config,tablename);
  byte[] row1 = Bytes.toBytes(row1);
  Put p1 = new Put(row1);
  byte[] databytes = Bytes.toBytes(data);
  p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1));
  table.put(p1);

  Get g = new Get(row1);
  Result result = table.get(g);
  System.out.println(Get : + result);
  Scan scan = new Scan();
  ResultScanner scanner = table.getScanner(scan);
  try
  {
   for(Result scannerResult: scanner)
   {
    System.out.println(Scan :  + scannerResult);
   }
  }catch(Exception e ){
   e.printStackTrace();
  }
  finally{
   scanner.close();
  }
  table.close();
  }
 }

 Now I have set the classpath variable in /etc/environment as
 MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar

 now I am compiling my code with javac command

 *$javac -classpath $MYCLASSPATH ExampleClient.java*

 It is working fine.
 While running, I am using java command

 *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the
 following error :
 Exception in thread main java.lang.NoClassDefFoundError: ExampleClient
 Caused by: java.lang.ClassNotFoundException: ExampleClient
        at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
 Could not find the main class: ExampleClient. Program will exit.
 But I am running the code from the same location. and ExampleClient.class
 file exists at that location.





 On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian 
 christian.kleegr...@siemens.com wrote:

 How do you execute the client (command line) do you use the java or the
 hadoop command?
 It seems that there is an error in your classpath when running the client
 job. The classpath when compiling classes that implement the client is
 different from the classpath when your client is executed since hadoop and
 hbase carry their own environment. Maybe tha following link helps:


 http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath


 regards
 Christian

 ---8

 Siemens AG
 Corporate Technology
 Corporate Research and Technologies
 CT T DE IT3
 Otto-Hahn-Ring 6
 81739 München, Deutschland
 Tel.: +49 (89) 636-42722
 Fax: +49 (89) 636-41423
 mailto:christian.kleegr...@siemens.com

 Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme;
 Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe
 Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y.
 Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland;
 Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB

Re: How to compile HBase code ?

Hey Harsh,

Actually I mailed to HBase mailing list also.. but since I wanted to get
this thing done as soon as possible so I mailed in this group also..
anyways I will take care of this in future , although  I got more responses
in this mailing list only :-)

Anyways problem is solved..

 What i did  is added the folder containing my .class file in the classpath,
along with commons-logging-1.0.4.jar and log4j-1.2.15.jar in my classpath:

so now Myclasspath variable looks like :


*
MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2..2.jar::/usr/local/hadoop/hbase/hbase/lib/commons-logging-1.0.4.jar:/usr/local/hadoop/hbase/hbase/lib/log4j-1.2.15.jar:/usr/local/hadoop/hbase/
*
* *
and then I used* java -classpath $MYCLASSPATH ExampleClient.java*
now its running..


Thanks.!!!
Praveenesh

On Tue, May 24, 2011 at 3:55 PM, Harsh J ha...@cloudera.com wrote:

 Praveenesh,

 HBase has their own user mailing lists where such queries ought to go.
 Am moving the discussion to u...@hbase.apache.org and bcc-ing
 common-user@ here. Also added you to cc.

 Regarding your first error, going forward you can use the useful
 `hbase classpath` to generate a HBase-provided classpath list for you
 automatically. Something like:

 $ MYCLASSPATH=`hbase classpath`

 Regarding the second, latest one as below, your ExampleClient.class
 isn't on the MYCLASSPATH (nor is the directory it is under, i.e. '.')
 so Java can't really find it. This is not a HBase issue.

 HTH.

 On Tue, May 24, 2011 at 3:23 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  I am simply using HBase API, not doing any Map-reduce work on it.
 
  Following is the code I have written , simply creating the file on HBase:
 
  import java.io.IOException;
  import org.apache.hadoop.hbase.HBaseConfiguration;
  import org.apache.hadoop.hbase.HColumnDescriptor;
  import org.apache.hadoop.hbase.HTableDescriptor;
  import org.apache.hadoop.hbase.client.Get;
  import org.apache.hadoop.hbase.client.HBaseAdmin;
  import org.apache.hadoop.hbase.client.HTable;
  import org.apache.hadoop.hbase.client.Put;
  import org.apache.hadoop.hbase.client.Result;
  import org.apache.hadoop.hbase.client.ResultScanner;
  import org.apache.hadoop.hbase.client.Scan;
  import org.apache.hadoop.hbase.util.Bytes;
 
  public class ExampleClient {
   public static void main(String args []) throws IOException
   {
   HBaseConfiguration config = new HBaseConfiguration();
 
   HBaseAdmin admin = new HBaseAdmin(config);
   HTableDescriptor htd = new HTableDescriptor(test);
   HColumnDescriptor hcd = new HColumnDescriptor(data);
   htd.addFamily(hcd);
   admin.createTable(htd);
 
   byte [] tablename = htd.getName();
   HTableDescriptor [] tables = admin.listTables();
 
   if(tables.length !=1  Bytes.equals(tablename, tables[0].getName()))
   {
throw new IOException(Failed to create table);
   }
 
   HTable table = new HTable(config,tablename);
   byte[] row1 = Bytes.toBytes(row1);
   Put p1 = new Put(row1);
   byte[] databytes = Bytes.toBytes(data);
   p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1));
   table.put(p1);
 
   Get g = new Get(row1);
   Result result = table.get(g);
   System.out.println(Get : + result);
   Scan scan = new Scan();
   ResultScanner scanner = table.getScanner(scan);
   try
   {
for(Result scannerResult: scanner)
{
 System.out.println(Scan :  + scannerResult);
}
   }catch(Exception e ){
e.printStackTrace();
   }
   finally{
scanner.close();
   }
   table.close();
   }
  }
 
  Now I have set the classpath variable in /etc/environment as
 
 MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar
 
  now I am compiling my code with javac command
 
  *$javac -classpath $MYCLASSPATH ExampleClient.java*
 
  It is working fine.
  While running, I am using java command
 
  *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the
  following error :
  Exception in thread main java.lang.NoClassDefFoundError: ExampleClient
  Caused by: java.lang.ClassNotFoundException: ExampleClient
 at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
 at java.security.AccessController.doPrivileged(Native Method)
 at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:321)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:266)
  Could not find the main class: ExampleClient. Program will exit.
  But I am running the code from the same location. and ExampleClient.class
  file exists at that location.
 
 
 
 
 
  On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian 
  christian.kleegr...@siemens.com wrote:
 
  How do you execute the client (command line) do you use the

Re: How to compile HBase code ?

2011-05-24 Thread Harsh J

Praveenesh,

On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com wrote:
 Hey Harsh,

 Actually I mailed to HBase mailing list also.. but since I wanted to get
 this thing done as soon as possible so I mailed in this group also..
 anyways I will take care of this in future , although  I got more responses
 in this mailing list only :-)

 Anyways problem is solved..

Good to know your problem resolved. You can also use the `bin/hbase
classpath` utility to generate HBase parts of the classpath
automatically in the future, instead of adding classes manually -
saves you time.

-- 
Harsh J

Re: How to compile HBase code ?

Hey harsh,

I tried that.. its not working.
I am using hbase 0.20.6.
there is no command like bin/hbase classpath :

hadoop@ub6:/usr/local/hadoop/hbase$ hbase
Usage: hbase command
where command is one of:
  shellrun the HBase shell
  master   run an HBase HMaster node
  regionserver run an HBase HRegionServer node
  rest run an HBase REST server
  thrift   run an HBase Thrift server
  zookeeperrun a Zookeeper server
  migrate  upgrade an hbase.rootdir
 or
  CLASSNAMErun the class named CLASSNAME
Thanks,
Praveenesh
On Tue, May 24, 2011 at 4:59 PM, Harsh J ha...@cloudera.com wrote:

 Praveenesh,

 On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  Hey Harsh,
 
  Actually I mailed to HBase mailing list also.. but since I wanted to get
  this thing done as soon as possible so I mailed in this group also..
  anyways I will take care of this in future , although  I got more
 responses
  in this mailing list only :-)
 
  Anyways problem is solved..

 Good to know your problem resolved. You can also use the `bin/hbase
 classpath` utility to generate HBase parts of the classpath
 automatically in the future, instead of adding classes manually -
 saves you time.

 --
 Harsh J

Re: How to compile HBase code ?

2011-05-24 Thread Harsh J

Praveenesh,

Ah yes it would not work on the older 0.20.x releases; The command
exists in the current HBase release.

On Tue, May 24, 2011 at 5:11 PM, praveenesh kumar praveen...@gmail.com wrote:
 Hey harsh,

 I tried that.. its not working.
 I am using hbase 0.20.6.
 there is no command like bin/hbase classpath :

 hadoop@ub6:/usr/local/hadoop/hbase$ hbase
 Usage: hbase command
 where command is one of:
  shell            run the HBase shell
  master           run an HBase HMaster node
  regionserver     run an HBase HRegionServer node
  rest             run an HBase REST server
  thrift           run an HBase Thrift server
  zookeeper        run a Zookeeper server
  migrate          upgrade an hbase.rootdir
  or
  CLASSNAME        run the class named CLASSNAME
 Thanks,
 Praveenesh
 On Tue, May 24, 2011 at 4:59 PM, Harsh J ha...@cloudera.com wrote:

 Praveenesh,

 On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com
 wrote:
  Hey Harsh,
 
  Actually I mailed to HBase mailing list also.. but since I wanted to get
  this thing done as soon as possible so I mailed in this group also..
  anyways I will take care of this in future , although  I got more
 responses
  in this mailing list only :-)
 
  Anyways problem is solved..

 Good to know your problem resolved. You can also use the `bin/hbase
 classpath` utility to generate HBase parts of the classpath
 automatically in the future, instead of adding classes manually -
 saves you time.

 --
 Harsh J





-- 
Harsh J

Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Maryanne.DellaSalla

I am attempting to familiarize myself with hadoop and utilizing
MapReduce in order to process system log files.  I had tried to start
small with a simple map reduce program similar to the word count example
provided.  I wanted for each line that I had read in, to grab the 5th
word as my output key, and the constant 1 as my output value.  This
seemed simple enough, but would consistently time out on mapping.  I
then attempted to run the WordCount example on my data to see if that
was the problem.  It was not, as the WordCount example quickly finished
with accurate results.  I then took the WordCount example, and added a
counter to the map so that it would only output the 5th word in the
line.  When I ran this, it ran for 18+ hrs with little to no progress.
I tried a programmatically identical way of getting the 5th word, and it
once again timed out.  Any help would be appreciated.

I am running in the Pseudo-Distributed layout described by the
Quickstart on a Windows XP machine running Cygwin.  I am working on
hadoop-0.21.0.  I have verified that I can run the examples provided and
that my nodes and trackers are running properly.

I took the WordCount example code described here: 

http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache
/hadoop/examples/WordCount.java?r=1027  

and changed the Map function to:
  public static class MapClass extends MapReduceBase
implements MapperLongWritable, Text, Text, IntWritable {
   
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
   
public void map(LongWritable key, Text value,
OutputCollectorText, IntWritable output,
Reporter reporter) throws IOException {
  int count = 0;
  String line = value.toString();
  StringTokenizer itr = new StringTokenizer(line);
  while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }
}
  }

Which after 18 hrs 35 min had map 0.55% complete.  There were no issues
in the logs or the command line.  Running this program without the count
variable maps in less than a minute on the same data.  When I changed it
to call itr.nextToken() 4 times before calling it a 5th to set the word,
it timed out.  I previously verified that the data always had more than
5 tokens per line.  My similar program which timed out regularly used
the split function on my delimiter to pull out the 5th word.  

Thank you for your help!
-   Maryanne DellaSalla

Re: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Ted Dunning

itr.nextToken() is inside the if.

On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:

while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }

question about BlockLocation setHosts

2011-05-24 Thread George Kousiouris



Hi all,

I had a question regarding the setHosts method of the BlockLocation 
class in hadoop hdfs. Does this make the block in question to be moved 
to the specified host?


Furthermore, where does the getHosts method of block location get the 
host names?


Thanks,
George

--

RE: Simple change to WordCount either times out or runs 18+ hrs with little progress

2011-05-24 Thread Maryanne.DellaSalla

Ahh, well that's embarrassing and explains the situation where it runs
for many hours. 

I am still baffled as to the split on delimiter version timing out,
though. 

  String line = value.toString();
  String[] splitLine = line.split(,);
  
  if( splitLine.length = 5 )
  {
word.set(splitLine[4]);
output.collect(word, one);
  }

This runs and times out on map every time.

Thanks.

Maryanne DellaSalla 

-Original Message-
From: Ted Dunning [mailto:tdunn...@maprtech.com] 
Sent: Tuesday, May 24, 2011 12:25 PM
To: common-user@hadoop.apache.org
Subject: Re: Simple change to WordCount either times out or runs 18+ hrs
with little progress

itr.nextToken() is inside the if.

On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote:

while (itr.hasMoreTokens()) {
if(count == 5)
{
word.set(itr.nextToken());
output.collect(word, one);
}
count++;
  }

Re: tips and tools to optimize cluster

2011-05-24 Thread Chris Smith

Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
precision on the historical data.
It also has some neat tracks around the collection and display of data.

Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
which is a light weight Perl script that
both captures and compresses the metrics, manages it's metrics data
files and then filters and presents
the metrics as requested.

I find collectl lightweight and useful enough that I set it up to
capture everything and
then leave it running in the background on most systems I build
because when you need the measurement
data the event is usually in the past and difficult to reproduce.
With collectl running I have a week to
recognise the event and analyse/save the relevant data file(s); data
file approx. 21MB/node/day gzipped.

With a little bit of bash or awk or perl scripting you can convert the
collectl output into a form easily
loadable into Pig.  Pig also has User Defined Functions (UDFs) that
can import the Hadoop job history so
with some Pig Latin you can marry your infrastructure metrics with
your job metrics; a bit like the cluster
eating it own dog food.

BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
report the full jvm metrics via gmond
although if you output the jvm metrics to file you get a record for
each jvm on the node.  I haven't looked
into it in detail yet but it looks like Gangla only reports the last
jvm record in each batch. Anyone else seen
this?

Chris

On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 I'm looking for tips, tricks and tools to get at node utilization to
 optimize our cluster.  I want answer questions like:
 - what nodes ran a particular job?
 - how long did it take for those nodes to run the tasks for that job?
 - how/why did Hadoop pick those nodes to begin with?

 More detailed questions like
 - how much memory did the task for the job use on that node?
 - average CPU load on that node during the task run

 And more aggregate questions like:
 - are some nodes favored more than others?
 - utilization averages (generally, how many cores on that node are in use, 
 etc.)

 There are plenty more that I'm not asking, but you get the point?  So,
 what are you guys using for this?

 I see some mentions of Ganglia, so I'll definitely look into that.
 Anything else?  Anything you're using to monitor in real-time (like a
 'top' across the nodes or something like that)?

 Any info or war-stories greatly appreciated.

 Thanks,

 Tom

Re: get name of file in mapper output directory

thanks both for the comments, but even though finally, I managed to get the
output file of the current mapper, I couldn't use it because apparently,
mappers uses _temporary file while it's in process. So in Mapper.close ,
the file for eg. part-0 which it wrote to, does not exists yet.

There has to be another way to get the produced file. I need to sort it
immediately within mappers.

Again, your thoughts are really helpful !

Mark

On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu pire...@crs4.it wrote:

The path is defined by the FileOutputFormat in use. In particular, I think
this function is responsible:

http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext
,
java.lang.String)

It should give you the file path before all tasks have completed and the
output
is committed to the final output path.

Luca

On May 23, 2011 14:42:04 Joey Echeverria wrote:
Hi Mark,

FYI, I'm moving the discussion over to
mapreduce-u...@hadoop.apache.org since your question is specific to
MapReduce.

You can derive the output name from the TaskAttemptID which you can
get by calling getTaskAttemptID() on the context passed to your
cleanup() funciton. The task attempt id will look like this:

attempt_200707121733_0003_m_05_0

You're interested in the m_05 part, This gets translated into the
output file name part-m-5.

-Joey

On Sat, May 21, 2011 at 8:03 PM, Mark question markq2...@gmail.com
wrote:
Hi,

I'm running a job with maps only and I want by end of each map
(ie.Close() function) to open the file that the current map has wrote
using its output.collector.

I know job.getWorkingDirectory() would give me the parent path of
the
file written, but how to get the full path or the name (ie. part-0
or
part-1).

Thanks,
Mark

--
Luca Pireddu
CRS4 - Distributed Computing Group
Loc. Pixina Manna Edificio 1
Pula 09010 (CA), Italy
Tel: +39 0709250452

Processing xml files

2011-05-24 Thread Mohit Anchlia

I just started learning hadoop and got done with wordcount mapreduce
example. I also briefly looked at hadoop streaming.

Some questions
1) What should  be my first step now? Are there more examples
somewhere that I can try out?
2) Second question is around pracitcal usability using xml files. Our
xml files are not big they are around 120k in size but hadoop is
really meant for big files so how do I go about processing these xml
files?
3) Are there any samples or advise on how to processing with xml files?


Looking for help and pointers.

EC2 cloudera cc1.4xlarge

Hello,

I am want to use cc1.4xlarge cluster for some data processing, to spin clusters 
I am using cloudera scripts. hadoop-ec2-init-remote.sh has default 
configuration until c1.xlarge but not configuration for cc1.4xlarge, can 
someone give formula how does this values calculated based on hardware?

C1.XLARGE
    MAX_MAP_TASKS=8 -  mapred.tasktracker.map.tasks.maximum
    MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum
    CHILD_OPTS=-Xmx680m - mapred.child.java.opts
    CHILD_ULIMIT=1392640 - mapred.child.ulimit

I am guessing but I think 

CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS)

But not sure how to calculate rest

Regards,
Aleksandr

Re: Processing xml files

Hello,

 We have the same type of data, we currently convert it to tab delimited file 
and use it as input for streaming

Regards,
Aleksandr

--- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

From: Mohit Anchlia mohitanch...@gmail.com
Subject: Processing xml files
To: common-user@hadoop.apache.org
Date: Tuesday, May 24, 2011, 4:16 PM

I just started learning hadoop and got done with wordcount mapreduce
example. I also briefly looked at hadoop streaming.

Some questions
1) What should  be my first step now? Are there more examples
somewhere that I can try out?
2) Second question is around pracitcal usability using xml files. Our
xml files are not big they are around 120k in size but hadoop is
really meant for big files so how do I go about processing these xml
files?
3) Are there any samples or advise on how to processing with xml files?


Looking for help and pointers.

Re: Processing xml files

2011-05-24 Thread Mohit Anchlia

On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote:
 Hello,

  We have the same type of data, we currently convert it to tab delimited file 
 and use it as input for streaming


Can you please give more info?
Do you append multiple xml files data as a line into one file? Or
someother way? If so then how big do you let files to be.
how do you create these files assuming your xml is stored somewhere
else in the DB or filesystem? read them one by one?
what are your experiences using text files instead of xml?
Reason why xml files can't be directly used in hadoop or shouldn't be used?
Any performance implications?
Any readings suggested in this area?

Our xml  is something like:

  column id=Name security=sensitive xsi:type=Text
   valuefree a last/value
  /column
  column id=age security=no xsi:type=Text
   value40/value
  /column

And we would for eg want to know how many customers above certain age
or certain age with certain income etc.

Sorry for all the questions. I am new and trying to get a grasp and
also learn how would I actually solve our use case.

 Regards,
 Aleksandr

 --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

 From: Mohit Anchlia mohitanch...@gmail.com
 Subject: Processing xml files
 To: common-user@hadoop.apache.org
 Date: Tuesday, May 24, 2011, 4:16 PM

 I just started learning hadoop and got done with wordcount mapreduce
 example. I also briefly looked at hadoop streaming.

 Some questions
 1) What should  be my first step now? Are there more examples
 somewhere that I can try out?
 2) Second question is around pracitcal usability using xml files. Our
 xml files are not big they are around 120k in size but hadoop is
 really meant for big files so how do I go about processing these xml
 files?
 3) Are there any samples or advise on how to processing with xml files?


 Looking for help and pointers.

Re: Sorting ...

Thanks Luca, but what other way to sort a directory of sequence files?

I don't plan to write a sorting algorithm in mappers/reducers, but hoping to
use the sequenceFile.sorter instead.

Any ideas?

Mark

On Mon, May 23, 2011 at 12:33 AM, Luca Pireddu pire...@crs4.it wrote:


 On May 22, 2011 03:21:53 Mark question wrote:
  I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But
  after taking a couple of minutes .. output is empty.

 snip

  I'm trying to find what the input format for the TeraSort is, but it is
 not
  specified.
 
  Thanks for any thought,
  Mark

 Terasort sorts lines of text.  The InputFormat (for version 0.20.2) is in


 hadoop-0.20.2/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java

 The documentation at the top of the class says An input format that reads
 the
 first 10 characters of each line as the key and the rest of the line as the
 value.

 HTH

 --
 Luca Pireddu
 CRS4 - Distributed Computing Group
 Loc. Pixina Manna Edificio 1
 Pula 09010 (CA), Italy
 Tel:  +39 0709250452

Re: tips and tools to optimize cluster

2011-05-24 Thread Tom Melendez

Thanks Chris, these are quite helpful.

Thanks,

Tom

On Tue, May 24, 2011 at 11:13 AM, Chris Smith csmi...@gmail.com wrote:
 Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose
 precision on the historical data.
 It also has some neat tracks around the collection and display of data.

 Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ )
 which is a light weight Perl script that
 both captures and compresses the metrics, manages it's metrics data
 files and then filters and presents
 the metrics as requested.

 I find collectl lightweight and useful enough that I set it up to
 capture everything and
 then leave it running in the background on most systems I build
 because when you need the measurement
 data the event is usually in the past and difficult to reproduce.
 With collectl running I have a week to
 recognise the event and analyse/save the relevant data file(s); data
 file approx. 21MB/node/day gzipped.

 With a little bit of bash or awk or perl scripting you can convert the
 collectl output into a form easily
 loadable into Pig.  Pig also has User Defined Functions (UDFs) that
 can import the Hadoop job history so
 with some Pig Latin you can marry your infrastructure metrics with
 your job metrics; a bit like the cluster
 eating it own dog food.

 BTW, watch out for a little gotcha with Ganglia.  It doesn't seem to
 report the full jvm metrics via gmond
 although if you output the jvm metrics to file you get a record for
 each jvm on the node.  I haven't looked
 into it in detail yet but it looks like Gangla only reports the last
 jvm record in each batch. Anyone else seen
 this?

 Chris

 On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote:
 Hi Folks,

 I'm looking for tips, tricks and tools to get at node utilization to
 optimize our cluster.  I want answer questions like:
 - what nodes ran a particular job?
 - how long did it take for those nodes to run the tasks for that job?
 - how/why did Hadoop pick those nodes to begin with?

 More detailed questions like
 - how much memory did the task for the job use on that node?
 - average CPU load on that node during the task run

 And more aggregate questions like:
 - are some nodes favored more than others?
 - utilization averages (generally, how many cores on that node are in use, 
 etc.)

 There are plenty more that I'm not asking, but you get the point?  So,
 what are you guys using for this?

 I see some mentions of Ganglia, so I'll definitely look into that.
 Anything else?  Anything you're using to monitor in real-time (like a
 'top' across the nodes or something like that)?

 Any info or war-stories greatly appreciated.

 Thanks,

 Tom

Re: Processing xml files

Can you please give more info?
 We currently have off hadoop process which uses java xml parser to convert 
 it to flat file. We have files from couple kb to 10of GB.

Do you append multiple xml files data as a line into one file? Or
someother way? If so then how big do you let files to be.


We currently feed to our process folder with converted files. We don't size it 
any way we let hadoop to handle it.

how do you create these files assuming your xml is stored somewhere
else in the DB or filesystem? read them one by one?


what are your experiences using text files instead of xml?
If you are using streaming job it is easier to build your logic if you have one 
file, you can actually try to parse xml in your mapper and convert it for 
reducer but, why you just don't write small app which will convert it?
 
Reason why xml files can't be directly used in hadoop or shouldn't be used?
Any performance implications?
If you are using Pig there is XML reader 
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html

If you have well define schema it is easier to work with big data :)

Any readings suggested in this area?
Try look into Pig it has lots of useful stuff, which will make your experience 
with hadoop nicer 

Our xml  is something like:

  column id=Name security=sensitive xsi:type=Text
   valuefree a last/value
  /column
  column id=age security=no xsi:type=Text
   value40/value
  /column

And we would for eg want to know how many customers above certain age
or certain age with certain income etc.

Hadoop has build in counter, did you look into word count example from hadoop?


Regards,
Aleksandr 

--- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

From: Mohit Anchlia mohitanch...@gmail.com
Subject: Re: Processing xml files
To: common-user@hadoop.apache.org
Date: Tuesday, May 24, 2011, 4:41 PM

On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote:
 Hello,

  We have the same type of data, we currently convert it to tab delimited file 
 and use it as input for streaming


Can you please give more info?
Do you append multiple xml files data as a line into one file? Or
someother way? If so then how big do you let files to be.
how do you create these files assuming your xml is stored somewhere
else in the DB or filesystem? read them one by one?
what are your experiences using text files instead of xml?
Reason why xml files can't be directly used in hadoop or shouldn't be used?
Any performance implications?
Any readings suggested in this area?

Our xml  is something like:

  column id=Name security=sensitive xsi:type=Text
   valuefree a last/value
  /column
  column id=age security=no xsi:type=Text
   value40/value
  /column

And we would for eg want to know how many customers above certain age
or certain age with certain income etc.

Sorry for all the questions. I am new and trying to get a grasp and
also learn how would I actually solve our use case.

 Regards,
 Aleksandr

 --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

 From: Mohit Anchlia mohitanch...@gmail.com
 Subject: Processing xml files
 To: common-user@hadoop.apache.org
 Date: Tuesday, May 24, 2011, 4:16 PM

 I just started learning hadoop and got done with wordcount mapreduce
 example. I also briefly looked at hadoop streaming.

 Some questions
 1) What should  be my first step now? Are there more examples
 somewhere that I can try out?
 2) Second question is around pracitcal usability using xml files. Our
 xml files are not big they are around 120k in size but hadoop is
 really meant for big files so how do I go about processing these xml
 files?
 3) Are there any samples or advise on how to processing with xml files?


 Looking for help and pointers.

Re: EC2 cloudera cc1.4xlarge

I look into different cluster and configurations from cloudera and came with 
this number let me know what do you think...

Machine 

23 GB of memory

33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture)

1690 GB of instance storage

64-bit platform

I/O Performance: Very High (10 Gigabit Ethernet)

API name: cc1.4xlarge

    MAX_MAP_TASKS=16 -  mapred.tasktracker.map.tasks.maximum
    MAX_REDUCE_TASKS=8 - mapred.tasktracker.reduce.tasks.maximum
    CHILD_OPTS=-Xmx1024m - mapred.child.java.opts
    CHILD_ULIMIT=1392640 - mapred.child.ulimit

Regards,
Aleksandr

--- On Tue, 5/24/11, Aleksandr Elbakyan ramal...@yahoo.com wrote:

From: Aleksandr Elbakyan ramal...@yahoo.com
Subject: EC2 cloudera cc1.4xlarge
To: common-user@hadoop.apache.org
Date: Tuesday, May 24, 2011, 4:23 PM

Hello,

I am want to use cc1.4xlarge cluster for some data processing, to spin clusters 
I am using cloudera scripts. hadoop-ec2-init-remote.sh has default 
configuration until c1.xlarge but not configuration for cc1.4xlarge, can 
someone give formula how does this values calculated based on hardware?

C1.XLARGE
    MAX_MAP_TASKS=8 -  mapred.tasktracker.map.tasks.maximum
    MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum
    CHILD_OPTS=-Xmx680m - mapred.child.java.opts
    CHILD_ULIMIT=1392640 - mapred.child.ulimit

I am guessing but I think 

CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS)

But not sure how to calculate rest

Regards,
Aleksandr

Checkpoint vs Backup Node

2011-05-24 Thread sulabh choudhury

As far as my understanding goes, I feel that Backup node is much more
efficient then the Checkpoint node, as it has the current(up-to-date) copy
of file system too.
I do not understand what would be the use case (in a production environment)
tin which someone would prefer Checkpoint node over Backup node, or I should
ask, what do people generally prefer of the two and why ?

Re: Processing xml files

2011-05-24 Thread Mohit Anchlia

Thanks some more questions :)

On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote:
 Can you please give more info?
 We currently have off hadoop process which uses java xml parser to convert 
 it to flat file. We have files from couple kb to 10of GB.

Do you convert it into a flat file and write it to HDFS? Do you write
all the files to the same directory in DFS or do you group directories
based on days for eg? So like 2011/01/01 contains 10 files. store
results of 10 files somewhere and then on 2011/02/02 store another say
20 files. Now analyze 20 files and use the results from 10 files to do
the aggregation. If so then how do you do it. Or how should I do it
since it will be overhead processing those files again.

Please point me to examples so that you don't have to teach me Hadoop
or pig processing :)


 Do you append multiple xml files data as a line into one file? Or
 someother way? If so then how big do you let files to be.


 We currently feed to our process folder with converted files. We don't size 
 it any way we let hadoop to handle it.

Didn't think about it. I was just thinking in terms of using big
files. So when using small files hadoop will automatically distribute
the files accross cluster I am assuming based on some hashing.


 how do you create these files assuming your xml is stored somewhere
 else in the DB or filesystem? read them one by one?


 what are your experiences using text files instead of xml?
 If you are using streaming job it is easier to build your logic if you have 
 one file, you can actually try to parse xml in your mapper and convert it for 
 reducer but, why you just don't write small app which will convert it?



 Reason why xml files can't be directly used in hadoop or shouldn't be used?
 Any performance implications?
 If you are using Pig there is XML reader 
 http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html


Which one is better? Converting files to flat files or using xml as
is? How do I make that decision?


 If you have well define schema it is easier to work with big data :)

 Any readings suggested in this area?
 Try look into Pig it has lots of useful stuff, which will make your 
 experience with hadoop nicer

I will download pig tutorial and see how that works. Is there any
other xml related examples you can point me to?

Thanks a lot!

 Our xml  is something like:

   column id=Name security=sensitive xsi:type=Text
    valuefree a last/value
   /column
   column id=age security=no xsi:type=Text
    value40/value
   /column

 And we would for eg want to know how many customers above certain age
 or certain age with certain income etc.

 Hadoop has build in counter, did you look into word count example from hadoop?


 Regards,
 Aleksandr

 --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

 From: Mohit Anchlia mohitanch...@gmail.com
 Subject: Re: Processing xml files
 To: common-user@hadoop.apache.org
 Date: Tuesday, May 24, 2011, 4:41 PM

 On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com 
 wrote:
 Hello,

  We have the same type of data, we currently convert it to tab delimited 
 file and use it as input for streaming


 Can you please give more info?
 Do you append multiple xml files data as a line into one file? Or
 someother way? If so then how big do you let files to be.
 how do you create these files assuming your xml is stored somewhere
 else in the DB or filesystem? read them one by one?
 what are your experiences using text files instead of xml?
 Reason why xml files can't be directly used in hadoop or shouldn't be used?
 Any performance implications?
 Any readings suggested in this area?

 Our xml  is something like:

   column id=Name security=sensitive xsi:type=Text
    valuefree a last/value
   /column
   column id=age security=no xsi:type=Text
    value40/value
   /column

 And we would for eg want to know how many customers above certain age
 or certain age with certain income etc.

 Sorry for all the questions. I am new and trying to get a grasp and
 also learn how would I actually solve our use case.

 Regards,
 Aleksandr

 --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

 From: Mohit Anchlia mohitanch...@gmail.com
 Subject: Processing xml files
 To: common-user@hadoop.apache.org
 Date: Tuesday, May 24, 2011, 4:16 PM

 I just started learning hadoop and got done with wordcount mapreduce
 example. I also briefly looked at hadoop streaming.

 Some questions
 1) What should  be my first step now? Are there more examples
 somewhere that I can try out?
 2) Second question is around pracitcal usability using xml files. Our
 xml files are not big they are around 120k in size but hadoop is
 really meant for big files so how do I go about processing these xml
 files?
 3) Are there any samples or advise on how to processing with xml files?


 Looking for help and pointers.

Re: Checkpoint vs Backup Node

2011-05-24 Thread Todd Lipcon

Hi Sulabh,

Neither of these nodes have been productionized -- so I don't think
anyone would have a good answer for you about what works in
production. They are only available in 0.21 and haven't had any
substantial QA.

One of the potential issues with the BN is that it can delay the
logging of edits by the primary NN, if the BN were to hang or go
offline. The CN would not have such an issue.

-Todd

On Tue, May 24, 2011 at 5:08 PM, sulabh choudhury sula...@gmail.com wrote:
 As far as my understanding goes, I feel that Backup node is much more
 efficient then the Checkpoint node, as it has the current(up-to-date) copy
 of file system too.
 I do not understand what would be the use case (in a production environment)
 tin which someone would prefer Checkpoint node over Backup node, or I should
 ask, what do people generally prefer of the two and why ?




-- 
Todd Lipcon
Software Engineer, Cloudera

Re: EC2 cloudera cc1.4xlarge

2011-05-24 Thread Konstantin Boudnik

Try cloudera specific lisls with your questions.
--
  Take care,
Konstantin (Cos) Boudnik
2CAC 8312 4870 D885 8616  6115 220F 6980 1F27 E622

Disclaimer: Opinions expressed in this email are those of the author,
and do not necessarily represent the views of any company the author
might be affiliated with at the moment of writing.



On Tue, May 24, 2011 at 16:23, Aleksandr Elbakyan ramal...@yahoo.com wrote:
 Hello,

 I am want to use cc1.4xlarge cluster for some data processing, to spin 
 clusters I am using cloudera scripts. hadoop-ec2-init-remote.sh has default 
 configuration until c1.xlarge but not configuration for cc1.4xlarge, can 
 someone give formula how does this values calculated based on hardware?

 C1.XLARGE
     MAX_MAP_TASKS=8 -  mapred.tasktracker.map.tasks.maximum
     MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum
     CHILD_OPTS=-Xmx680m - mapred.child.java.opts
     CHILD_ULIMIT=1392640 - mapred.child.ulimit

 I am guessing but I think

 CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS)

 But not sure how to calculate rest

 Regards,
 Aleksandr

Re: Processing xml files

Hello,

We currently have complicated process which has more then 20 jobs piped to each 
other. 
We are using shell script to control the flow, I saw some other company they 
were using spring batch. We use pig, streaming and hive

 Not one thing if you are using ec2 for your jobs all local files need to be 
stored in /mnt Currently our cluster is organized this way in hdfs, and we 
process our data hourly and rotate final result to the beginning of pipeline 
for next process. Each process output is next process input, so we keep all 
data for current execution in the same dated folder so if you run daily it will 
be eg 20111212 if hourly 201112121416, and add subfolder for each subprocess 
into it
Example
/user/{domain}/{date}/input
 /user/{domain}/{date}/process1
/user/{domain}/{date}/process2
/user/{domain}/{date}/process3
/user/{domain}/{date}/process4
our process1 takes input for current converted files and output from last 
process.
After we start the job we load converted files into input location and move 
them out from local space so we will not reprocess them.

Not sure if there is examples, this all will depend on architecture of the 
project you are doing, I bet if you will put all you need to do on whiteboard 
you will find best folder structure for yourself :)

Regards,
Aleksandr
 

--- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

From: Mohit Anchlia mohitanch...@gmail.com
Subject: Re: Processing xml files
To: common-user@hadoop.apache.org
Date: Tuesday, May 24, 2011, 5:20 PM

Thanks some more questions :)

On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote:
 Can you please give more info?
 We currently have off hadoop process which uses java xml parser to convert 
 it to flat file. We have files from couple kb to 10of GB.

Do you convert it into a flat file and write it to HDFS? Do you write
all the files to the same directory in DFS or do you group directories
based on days for eg? So like 2011/01/01 contains 10 files. store
results of 10 files somewhere and then on 2011/02/02 store another say
20 files. Now analyze 20 files and use the results from 10 files to do
the aggregation. If so then how do you do it. Or how should I do it
since it will be overhead processing those files again.

Please point me to examples so that you don't have to teach me Hadoop
or pig processing :)


 Do you append multiple xml files data as a line into one file? Or
 someother way? If so then how big do you let files to be.


 We currently feed to our process folder with converted files. We don't size 
 it any way we let hadoop to handle it.

Didn't think about it. I was just thinking in terms of using big
files. So when using small files hadoop will automatically distribute
the files accross cluster I am assuming based on some hashing.


 how do you create these files assuming your xml is stored somewhere
 else in the DB or filesystem? read them one by one?


 what are your experiences using text files instead of xml?
 If you are using streaming job it is easier to build your logic if you have 
 one file, you can actually try to parse xml in your mapper and convert it for 
 reducer but, why you just don't write small app which will convert it?



 Reason why xml files can't be directly used in hadoop or shouldn't be used?
 Any performance implications?
 If you are using Pig there is XML reader 
 http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html


Which one is better? Converting files to flat files or using xml as
is? How do I make that decision?


 If you have well define schema it is easier to work with big data :)

 Any readings suggested in this area?
 Try look into Pig it has lots of useful stuff, which will make your 
 experience with hadoop nicer

I will download pig tutorial and see how that works. Is there any
other xml related examples you can point me to?

Thanks a lot!

 Our xml  is something like:

   column id=Name security=sensitive xsi:type=Text
    valuefree a last/value
   /column
   column id=age security=no xsi:type=Text
    value40/value
   /column

 And we would for eg want to know how many customers above certain age
 or certain age with certain income etc.

 Hadoop has build in counter, did you look into word count example from hadoop?


 Regards,
 Aleksandr

 --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote:

 From: Mohit Anchlia mohitanch...@gmail.com
 Subject: Re: Processing xml files
 To: common-user@hadoop.apache.org
 Date: Tuesday, May 24, 2011, 4:41 PM

 On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com 
 wrote:
 Hello,

  We have the same type of data, we currently convert it to tab delimited 
 file and use it as input for streaming


 Can you please give more info?
 Do you append multiple xml files data as a line into one file? Or
 someother way? If so then how big do you let files to be.
 how do you create these files assuming your xml is stored somewhere
 else in

Cannot lock storage, directory is already locked

Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage
/cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. Any ideas on how to solve this error?

Thanks,
Mark

I can't see this email ... So to clarify ..

Hi guys,

I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
the nodes to be my hadoop cluster. So my problem is this. Datanode won't
start in one of the nodes because of the following error:

org.apache.hadoop.hdfs.server.
common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
The directory is already locked

I think it's because of the NFS property which allows one node to lock it
then the second node can't lock it. So I had to change the following
configuration:
   dfs.data.dir to be /tmp/hadoop-user/dfs/data

But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my
hadoop.tmp.dir =  /cs/student/mark/tmp as you might guess from above.

Where is this configuration over-written ? I thought my core-site.xml has
the final configuration values.
Thanks,
Mark

Re: I can't see this email ... So to clarify ..

2011-05-24 Thread Joey Echeverria

Try moving the the configuration to hdfs-site.xml.

One word of warning, if you use /tmp to store your HDFS data, you risk
data loss. On many operating systems, files and directories in /tmp
are automatically deleted.

-Joey

On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote:
 Hi guys,

 I'm using an NFS cluster consisting of 30 machines, but only specified 3 of
 the nodes to be my hadoop cluster. So my problem is this. Datanode won't
 start in one of the nodes because of the following error:

 org.apache.hadoop.hdfs.server.
 common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data.
 The directory is already locked

 I think it's because of the NFS property which allows one node to lock it
 then the second node can't lock it. So I had to change the following
 configuration:
       dfs.data.dir to be /tmp/hadoop-user/dfs/data

 But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my
 hadoop.tmp.dir =  /cs/student/mark/tmp as you might guess from above.

 Where is this configuration over-written ? I thought my core-site.xml has
 the final configuration values.
 Thanks,
 Mark




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: I can't see this email ... So to clarify ..