How to compile HBase code ?
Hello guys, In case any of you are working on HBASE, I just wrote a program by reading some tutorials.. But no where its mentioned how to run codes on HBASE. In case anyone of you has done some coding on HBASE , can you please tell me how to run it. I am able to compile my code by adding hbase-core.jar and hadoop-core.jar in classpath while compiling it. But not able to figure out how to run it. Whenever I am doing java ExampleClient ( which is my Hbase program), I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at ExampleClient.main(ExampleClient.java:20) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) ... 1 more Thanks, Praveenesh
AW: How to compile HBase code ?
How do you execute the client (command line) do you use the java or the hadoop command? It seems that there is an error in your classpath when running the client job. The classpath when compiling classes that implement the client is different from the classpath when your client is executed since hadoop and hbase carry their own environment. Maybe tha following link helps: http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath regards Christian ---8 Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 (89) 636-42722 Fax: +49 (89) 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -Ursprüngliche Nachricht- Von: praveenesh kumar [mailto:praveen...@gmail.com] Gesendet: Dienstag, 24. Mai 2011 11:08 An: common-user@hadoop.apache.org Betreff: How to compile HBase code ? Hello guys, In case any of you are working on HBASE, I just wrote a program by reading some tutorials.. But no where its mentioned how to run codes on HBASE. In case anyone of you has done some coding on HBASE , can you please tell me how to run it. I am able to compile my code by adding hbase-core.jar and hadoop-core.jar in classpath while compiling it. But not able to figure out how to run it. Whenever I am doing java ExampleClient ( which is my Hbase program), I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at ExampleClient.main(ExampleClient.java:20) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.HBaseConfiguration at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) ... 1 more Thanks, Praveenesh
Re: How to compile HBase code ?
I am simply using HBase API, not doing any Map-reduce work on it. Following is the code I have written , simply creating the file on HBase: import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class ExampleClient { public static void main(String args []) throws IOException { HBaseConfiguration config = new HBaseConfiguration(); HBaseAdmin admin = new HBaseAdmin(config); HTableDescriptor htd = new HTableDescriptor(test); HColumnDescriptor hcd = new HColumnDescriptor(data); htd.addFamily(hcd); admin.createTable(htd); byte [] tablename = htd.getName(); HTableDescriptor [] tables = admin.listTables(); if(tables.length !=1 Bytes.equals(tablename, tables[0].getName())) { throw new IOException(Failed to create table); } HTable table = new HTable(config,tablename); byte[] row1 = Bytes.toBytes(row1); Put p1 = new Put(row1); byte[] databytes = Bytes.toBytes(data); p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1)); table.put(p1); Get g = new Get(row1); Result result = table.get(g); System.out.println(Get : + result); Scan scan = new Scan(); ResultScanner scanner = table.getScanner(scan); try { for(Result scannerResult: scanner) { System.out.println(Scan : + scannerResult); } }catch(Exception e ){ e.printStackTrace(); } finally{ scanner.close(); } table.close(); } } Now I have set the classpath variable in /etc/environment as MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar now I am compiling my code with javac command *$javac -classpath $MYCLASSPATH ExampleClient.java* It is working fine. While running, I am using java command *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: ExampleClient Caused by: java.lang.ClassNotFoundException: ExampleClient at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: ExampleClient. Program will exit. But I am running the code from the same location. and ExampleClient.class file exists at that location. On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: How do you execute the client (command line) do you use the java or the hadoop command? It seems that there is an error in your classpath when running the client job. The classpath when compiling classes that implement the client is different from the classpath when your client is executed since hadoop and hbase carry their own environment. Maybe tha following link helps: http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath regards Christian ---8 Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 (89) 636-42722 Fax: +49 (89) 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -Ursprüngliche Nachricht- Von: praveenesh kumar [mailto:praveen...@gmail.com] Gesendet: Dienstag, 24. Mai 2011 11:08 An: common-user@hadoop.apache.org Betreff: How to compile HBase code ? Hello guys, In case any of you are working on HBASE, I just wrote a program by reading some tutorials.. But no where its mentioned how to run codes on HBASE. In case anyone of you has done some coding on HBASE , can you please tell me how to run it. I am able to compile my code by adding hbase-core.jar and hadoop-core.jar in classpath while compiling it. But not able to figure out how to run it. Whenever I am doing java ExampleClient ( which is my
AW: How to compile HBase code ?
Are you sure that the directory where your ExampleClient.class is locates is part of the MYCLASSPATH? regards Christian ---8 Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 (89) 636-42722 Fax: +49 (89) 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB 6684; WEEE-Reg.-Nr. DE 23691322 -Ursprüngliche Nachricht- Von: praveenesh kumar [mailto:praveen...@gmail.com] Gesendet: Dienstag, 24. Mai 2011 11:54 An: common-user@hadoop.apache.org Betreff: Re: How to compile HBase code ? I am simply using HBase API, not doing any Map-reduce work on it. Following is the code I have written , simply creating the file on HBase: import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class ExampleClient { public static void main(String args []) throws IOException { HBaseConfiguration config = new HBaseConfiguration(); HBaseAdmin admin = new HBaseAdmin(config); HTableDescriptor htd = new HTableDescriptor(test); HColumnDescriptor hcd = new HColumnDescriptor(data); htd.addFamily(hcd); admin.createTable(htd); byte [] tablename = htd.getName(); HTableDescriptor [] tables = admin.listTables(); if(tables.length !=1 Bytes.equals(tablename, tables[0].getName())) { throw new IOException(Failed to create table); } HTable table = new HTable(config,tablename); byte[] row1 = Bytes.toBytes(row1); Put p1 = new Put(row1); byte[] databytes = Bytes.toBytes(data); p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1)); table.put(p1); Get g = new Get(row1); Result result = table.get(g); System.out.println(Get : + result); Scan scan = new Scan(); ResultScanner scanner = table.getScanner(scan); try { for(Result scannerResult: scanner) { System.out.println(Scan : + scannerResult); } }catch(Exception e ){ e.printStackTrace(); } finally{ scanner.close(); } table.close(); } } Now I have set the classpath variable in /etc/environment as MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar now I am compiling my code with javac command *$javac -classpath $MYCLASSPATH ExampleClient.java* It is working fine. While running, I am using java command *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: ExampleClient Caused by: java.lang.ClassNotFoundException: ExampleClient at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: ExampleClient. Program will exit. But I am running the code from the same location. and ExampleClient.class file exists at that location. On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: How do you execute the client (command line) do you use the java or the hadoop command? It seems that there is an error in your classpath when running the client job. The classpath when compiling classes that implement the client is different from the classpath when your client is executed since hadoop and hbase carry their own environment. Maybe tha following link helps: http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath regards Christian ---8 Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 (89) 636-42722 Fax: +49 (89) 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme;
Re: How to compile HBase code ?
Praveenesh, HBase has their own user mailing lists where such queries ought to go. Am moving the discussion to u...@hbase.apache.org and bcc-ing common-user@ here. Also added you to cc. Regarding your first error, going forward you can use the useful `hbase classpath` to generate a HBase-provided classpath list for you automatically. Something like: $ MYCLASSPATH=`hbase classpath` Regarding the second, latest one as below, your ExampleClient.class isn't on the MYCLASSPATH (nor is the directory it is under, i.e. '.') so Java can't really find it. This is not a HBase issue. HTH. On Tue, May 24, 2011 at 3:23 PM, praveenesh kumar praveen...@gmail.com wrote: I am simply using HBase API, not doing any Map-reduce work on it. Following is the code I have written , simply creating the file on HBase: import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class ExampleClient { public static void main(String args []) throws IOException { HBaseConfiguration config = new HBaseConfiguration(); HBaseAdmin admin = new HBaseAdmin(config); HTableDescriptor htd = new HTableDescriptor(test); HColumnDescriptor hcd = new HColumnDescriptor(data); htd.addFamily(hcd); admin.createTable(htd); byte [] tablename = htd.getName(); HTableDescriptor [] tables = admin.listTables(); if(tables.length !=1 Bytes.equals(tablename, tables[0].getName())) { throw new IOException(Failed to create table); } HTable table = new HTable(config,tablename); byte[] row1 = Bytes.toBytes(row1); Put p1 = new Put(row1); byte[] databytes = Bytes.toBytes(data); p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1)); table.put(p1); Get g = new Get(row1); Result result = table.get(g); System.out.println(Get : + result); Scan scan = new Scan(); ResultScanner scanner = table.getScanner(scan); try { for(Result scannerResult: scanner) { System.out.println(Scan : + scannerResult); } }catch(Exception e ){ e.printStackTrace(); } finally{ scanner.close(); } table.close(); } } Now I have set the classpath variable in /etc/environment as MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar now I am compiling my code with javac command *$javac -classpath $MYCLASSPATH ExampleClient.java* It is working fine. While running, I am using java command *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: ExampleClient Caused by: java.lang.ClassNotFoundException: ExampleClient at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: ExampleClient. Program will exit. But I am running the code from the same location. and ExampleClient.class file exists at that location. On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: How do you execute the client (command line) do you use the java or the hadoop command? It seems that there is an error in your classpath when running the client job. The classpath when compiling classes that implement the client is different from the classpath when your client is executed since hadoop and hbase carry their own environment. Maybe tha following link helps: http://hbase.apache.org/docs/current/api/org/apache/hadoop/hbase/mapreduce/package-summary.html#classpath regards Christian ---8 Siemens AG Corporate Technology Corporate Research and Technologies CT T DE IT3 Otto-Hahn-Ring 6 81739 München, Deutschland Tel.: +49 (89) 636-42722 Fax: +49 (89) 636-41423 mailto:christian.kleegr...@siemens.com Siemens Aktiengesellschaft: Vorsitzender des Aufsichtsrats: Gerhard Cromme; Vorstand: Peter Löscher, Vorsitzender; Wolfgang Dehen, Brigitte Ederer, Joe Kaeser, Barbara Kux, Hermann Requardt, Siegfried Russwurm, Peter Y. Solmssen; Sitz der Gesellschaft: Berlin und München, Deutschland; Registergericht: Berlin Charlottenburg, HRB 12300, München, HRB
Re: How to compile HBase code ?
Hey Harsh, Actually I mailed to HBase mailing list also.. but since I wanted to get this thing done as soon as possible so I mailed in this group also.. anyways I will take care of this in future , although I got more responses in this mailing list only :-) Anyways problem is solved.. What i did is added the folder containing my .class file in the classpath, along with commons-logging-1.0.4.jar and log4j-1.2.15.jar in my classpath: so now Myclasspath variable looks like : * MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2..2.jar::/usr/local/hadoop/hbase/hbase/lib/commons-logging-1.0.4.jar:/usr/local/hadoop/hbase/hbase/lib/log4j-1.2.15.jar:/usr/local/hadoop/hbase/ * * * and then I used* java -classpath $MYCLASSPATH ExampleClient.java* now its running.. Thanks.!!! Praveenesh On Tue, May 24, 2011 at 3:55 PM, Harsh J ha...@cloudera.com wrote: Praveenesh, HBase has their own user mailing lists where such queries ought to go. Am moving the discussion to u...@hbase.apache.org and bcc-ing common-user@ here. Also added you to cc. Regarding your first error, going forward you can use the useful `hbase classpath` to generate a HBase-provided classpath list for you automatically. Something like: $ MYCLASSPATH=`hbase classpath` Regarding the second, latest one as below, your ExampleClient.class isn't on the MYCLASSPATH (nor is the directory it is under, i.e. '.') so Java can't really find it. This is not a HBase issue. HTH. On Tue, May 24, 2011 at 3:23 PM, praveenesh kumar praveen...@gmail.com wrote: I am simply using HBase API, not doing any Map-reduce work on it. Following is the code I have written , simply creating the file on HBase: import java.io.IOException; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class ExampleClient { public static void main(String args []) throws IOException { HBaseConfiguration config = new HBaseConfiguration(); HBaseAdmin admin = new HBaseAdmin(config); HTableDescriptor htd = new HTableDescriptor(test); HColumnDescriptor hcd = new HColumnDescriptor(data); htd.addFamily(hcd); admin.createTable(htd); byte [] tablename = htd.getName(); HTableDescriptor [] tables = admin.listTables(); if(tables.length !=1 Bytes.equals(tablename, tables[0].getName())) { throw new IOException(Failed to create table); } HTable table = new HTable(config,tablename); byte[] row1 = Bytes.toBytes(row1); Put p1 = new Put(row1); byte[] databytes = Bytes.toBytes(data); p1.add(databytes,Bytes.toBytes(1),Bytes.toBytes(value1)); table.put(p1); Get g = new Get(row1); Result result = table.get(g); System.out.println(Get : + result); Scan scan = new Scan(); ResultScanner scanner = table.getScanner(scan); try { for(Result scannerResult: scanner) { System.out.println(Scan : + scannerResult); } }catch(Exception e ){ e.printStackTrace(); } finally{ scanner.close(); } table.close(); } } Now I have set the classpath variable in /etc/environment as MYCLASSPATH=/usr/local/hadoop/hadoop/hadoop-0.20.2-core.jar:/usr/local/hadoop/hbase/hbase/hbase-0.20.6.jar:/usr/local/hadoop/hbase/hbase/lib/zookeeper-3.2.2.jar now I am compiling my code with javac command *$javac -classpath $MYCLASSPATH ExampleClient.java* It is working fine. While running, I am using java command *$java -classpath $MYCLASSPATH ExampleClient*, then I am getting the following error : Exception in thread main java.lang.NoClassDefFoundError: ExampleClient Caused by: java.lang.ClassNotFoundException: ExampleClient at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:321) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294) at java.lang.ClassLoader.loadClass(ClassLoader.java:266) Could not find the main class: ExampleClient. Program will exit. But I am running the code from the same location. and ExampleClient.class file exists at that location. On Tue, May 24, 2011 at 3:07 PM, Kleegrewe, Christian christian.kleegr...@siemens.com wrote: How do you execute the client (command line) do you use the
Re: How to compile HBase code ?
Praveenesh, On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com wrote: Hey Harsh, Actually I mailed to HBase mailing list also.. but since I wanted to get this thing done as soon as possible so I mailed in this group also.. anyways I will take care of this in future , although I got more responses in this mailing list only :-) Anyways problem is solved.. Good to know your problem resolved. You can also use the `bin/hbase classpath` utility to generate HBase parts of the classpath automatically in the future, instead of adding classes manually - saves you time. -- Harsh J
Re: How to compile HBase code ?
Hey harsh, I tried that.. its not working. I am using hbase 0.20.6. there is no command like bin/hbase classpath : hadoop@ub6:/usr/local/hadoop/hbase$ hbase Usage: hbase command where command is one of: shellrun the HBase shell master run an HBase HMaster node regionserver run an HBase HRegionServer node rest run an HBase REST server thrift run an HBase Thrift server zookeeperrun a Zookeeper server migrate upgrade an hbase.rootdir or CLASSNAMErun the class named CLASSNAME Thanks, Praveenesh On Tue, May 24, 2011 at 4:59 PM, Harsh J ha...@cloudera.com wrote: Praveenesh, On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com wrote: Hey Harsh, Actually I mailed to HBase mailing list also.. but since I wanted to get this thing done as soon as possible so I mailed in this group also.. anyways I will take care of this in future , although I got more responses in this mailing list only :-) Anyways problem is solved.. Good to know your problem resolved. You can also use the `bin/hbase classpath` utility to generate HBase parts of the classpath automatically in the future, instead of adding classes manually - saves you time. -- Harsh J
Re: How to compile HBase code ?
Praveenesh, Ah yes it would not work on the older 0.20.x releases; The command exists in the current HBase release. On Tue, May 24, 2011 at 5:11 PM, praveenesh kumar praveen...@gmail.com wrote: Hey harsh, I tried that.. its not working. I am using hbase 0.20.6. there is no command like bin/hbase classpath : hadoop@ub6:/usr/local/hadoop/hbase$ hbase Usage: hbase command where command is one of: shell run the HBase shell master run an HBase HMaster node regionserver run an HBase HRegionServer node rest run an HBase REST server thrift run an HBase Thrift server zookeeper run a Zookeeper server migrate upgrade an hbase.rootdir or CLASSNAME run the class named CLASSNAME Thanks, Praveenesh On Tue, May 24, 2011 at 4:59 PM, Harsh J ha...@cloudera.com wrote: Praveenesh, On Tue, May 24, 2011 at 4:31 PM, praveenesh kumar praveen...@gmail.com wrote: Hey Harsh, Actually I mailed to HBase mailing list also.. but since I wanted to get this thing done as soon as possible so I mailed in this group also.. anyways I will take care of this in future , although I got more responses in this mailing list only :-) Anyways problem is solved.. Good to know your problem resolved. You can also use the `bin/hbase classpath` utility to generate HBase parts of the classpath automatically in the future, instead of adding classes manually - saves you time. -- Harsh J -- Harsh J
Simple change to WordCount either times out or runs 18+ hrs with little progress
I am attempting to familiarize myself with hadoop and utilizing MapReduce in order to process system log files. I had tried to start small with a simple map reduce program similar to the word count example provided. I wanted for each line that I had read in, to grab the 5th word as my output key, and the constant 1 as my output value. This seemed simple enough, but would consistently time out on mapping. I then attempted to run the WordCount example on my data to see if that was the problem. It was not, as the WordCount example quickly finished with accurate results. I then took the WordCount example, and added a counter to the map so that it would only output the 5th word in the line. When I ran this, it ran for 18+ hrs with little to no progress. I tried a programmatically identical way of getting the 5th word, and it once again timed out. Any help would be appreciated. I am running in the Pseudo-Distributed layout described by the Quickstart on a Windows XP machine running Cygwin. I am working on hadoop-0.21.0. I have verified that I can run the examples provided and that my nodes and trackers are running properly. I took the WordCount example code described here: http://code.google.com/p/hop/source/browse/trunk/src/examples/org/apache /hadoop/examples/WordCount.java?r=1027 and changed the Map function to: public static class MapClass extends MapReduceBase implements MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollectorText, IntWritable output, Reporter reporter) throws IOException { int count = 0; String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; } } } Which after 18 hrs 35 min had map 0.55% complete. There were no issues in the logs or the command line. Running this program without the count variable maps in less than a minute on the same data. When I changed it to call itr.nextToken() 4 times before calling it a 5th to set the word, it timed out. I previously verified that the data always had more than 5 tokens per line. My similar program which timed out regularly used the split function on my delimiter to pull out the 5th word. Thank you for your help! - Maryanne DellaSalla
Re: Simple change to WordCount either times out or runs 18+ hrs with little progress
itr.nextToken() is inside the if. On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote: while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; }
question about BlockLocation setHosts
Hi all, I had a question regarding the setHosts method of the BlockLocation class in hadoop hdfs. Does this make the block in question to be moved to the specified host? Furthermore, where does the getHosts method of block location get the host names? Thanks, George --
RE: Simple change to WordCount either times out or runs 18+ hrs with little progress
Ahh, well that's embarrassing and explains the situation where it runs for many hours. I am still baffled as to the split on delimiter version timing out, though. String line = value.toString(); String[] splitLine = line.split(,); if( splitLine.length = 5 ) { word.set(splitLine[4]); output.collect(word, one); } This runs and times out on map every time. Thanks. Maryanne DellaSalla -Original Message- From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Tuesday, May 24, 2011 12:25 PM To: common-user@hadoop.apache.org Subject: Re: Simple change to WordCount either times out or runs 18+ hrs with little progress itr.nextToken() is inside the if. On Tue, May 24, 2011 at 7:29 AM, maryanne.dellasa...@gdc4s.com wrote: while (itr.hasMoreTokens()) { if(count == 5) { word.set(itr.nextToken()); output.collect(word, one); } count++; }
Re: tips and tools to optimize cluster
Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose precision on the historical data. It also has some neat tracks around the collection and display of data. Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ ) which is a light weight Perl script that both captures and compresses the metrics, manages it's metrics data files and then filters and presents the metrics as requested. I find collectl lightweight and useful enough that I set it up to capture everything and then leave it running in the background on most systems I build because when you need the measurement data the event is usually in the past and difficult to reproduce. With collectl running I have a week to recognise the event and analyse/save the relevant data file(s); data file approx. 21MB/node/day gzipped. With a little bit of bash or awk or perl scripting you can convert the collectl output into a form easily loadable into Pig. Pig also has User Defined Functions (UDFs) that can import the Hadoop job history so with some Pig Latin you can marry your infrastructure metrics with your job metrics; a bit like the cluster eating it own dog food. BTW, watch out for a little gotcha with Ganglia. It doesn't seem to report the full jvm metrics via gmond although if you output the jvm metrics to file you get a record for each jvm on the node. I haven't looked into it in detail yet but it looks like Gangla only reports the last jvm record in each batch. Anyone else seen this? Chris On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote: Hi Folks, I'm looking for tips, tricks and tools to get at node utilization to optimize our cluster. I want answer questions like: - what nodes ran a particular job? - how long did it take for those nodes to run the tasks for that job? - how/why did Hadoop pick those nodes to begin with? More detailed questions like - how much memory did the task for the job use on that node? - average CPU load on that node during the task run And more aggregate questions like: - are some nodes favored more than others? - utilization averages (generally, how many cores on that node are in use, etc.) There are plenty more that I'm not asking, but you get the point? So, what are you guys using for this? I see some mentions of Ganglia, so I'll definitely look into that. Anything else? Anything you're using to monitor in real-time (like a 'top' across the nodes or something like that)? Any info or war-stories greatly appreciated. Thanks, Tom
Re: get name of file in mapper output directory
thanks both for the comments, but even though finally, I managed to get the output file of the current mapper, I couldn't use it because apparently, mappers uses _temporary file while it's in process. So in Mapper.close , the file for eg. part-0 which it wrote to, does not exists yet. There has to be another way to get the produced file. I need to sort it immediately within mappers. Again, your thoughts are really helpful ! Mark On Mon, May 23, 2011 at 5:51 AM, Luca Pireddu pire...@crs4.it wrote: The path is defined by the FileOutputFormat in use. In particular, I think this function is responsible: http://hadoop.apache.org/common/docs/r0.20.2/api/org/apache/hadoop/mapreduce/lib/output/FileOutputFormat.html#getDefaultWorkFile(org.apache.hadoop.mapreduce.TaskAttemptContext , java.lang.String) It should give you the file path before all tasks have completed and the output is committed to the final output path. Luca On May 23, 2011 14:42:04 Joey Echeverria wrote: Hi Mark, FYI, I'm moving the discussion over to mapreduce-u...@hadoop.apache.org since your question is specific to MapReduce. You can derive the output name from the TaskAttemptID which you can get by calling getTaskAttemptID() on the context passed to your cleanup() funciton. The task attempt id will look like this: attempt_200707121733_0003_m_05_0 You're interested in the m_05 part, This gets translated into the output file name part-m-5. -Joey On Sat, May 21, 2011 at 8:03 PM, Mark question markq2...@gmail.com wrote: Hi, I'm running a job with maps only and I want by end of each map (ie.Close() function) to open the file that the current map has wrote using its output.collector. I know job.getWorkingDirectory() would give me the parent path of the file written, but how to get the full path or the name (ie. part-0 or part-1). Thanks, Mark -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Processing xml files
I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers.
EC2 cloudera cc1.4xlarge
Hello, I am want to use cc1.4xlarge cluster for some data processing, to spin clusters I am using cloudera scripts. hadoop-ec2-init-remote.sh has default configuration until c1.xlarge but not configuration for cc1.4xlarge, can someone give formula how does this values calculated based on hardware? C1.XLARGE MAX_MAP_TASKS=8 - mapred.tasktracker.map.tasks.maximum MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum CHILD_OPTS=-Xmx680m - mapred.child.java.opts CHILD_ULIMIT=1392640 - mapred.child.ulimit I am guessing but I think CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS) But not sure how to calculate rest Regards, Aleksandr
Re: Processing xml files
Hello, We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:16 PM I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers.
Re: Processing xml files
On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello, We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming Can you please give more info? Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? Any readings suggested in this area? Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Sorry for all the questions. I am new and trying to get a grasp and also learn how would I actually solve our use case. Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:16 PM I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers.
Re: Sorting ...
Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark On Mon, May 23, 2011 at 12:33 AM, Luca Pireddu pire...@crs4.it wrote: On May 22, 2011 03:21:53 Mark question wrote: I'm trying to sort Sequence files using the Hadoop-Example TeraSort. But after taking a couple of minutes .. output is empty. snip I'm trying to find what the input format for the TeraSort is, but it is not specified. Thanks for any thought, Mark Terasort sorts lines of text. The InputFormat (for version 0.20.2) is in hadoop-0.20.2/src/examples/org/apache/hadoop/examples/terasort/TeraInputFormat.java The documentation at the top of the class says An input format that reads the first 10 characters of each line as the key and the rest of the line as the value. HTH -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: tips and tools to optimize cluster
Thanks Chris, these are quite helpful. Thanks, Tom On Tue, May 24, 2011 at 11:13 AM, Chris Smith csmi...@gmail.com wrote: Worth a look at OpenTSDB ( http://opentsdb.net/ ) as it doesn't lose precision on the historical data. It also has some neat tracks around the collection and display of data. Another useful tool is 'collectl' ( http://collectl.sourceforge.net/ ) which is a light weight Perl script that both captures and compresses the metrics, manages it's metrics data files and then filters and presents the metrics as requested. I find collectl lightweight and useful enough that I set it up to capture everything and then leave it running in the background on most systems I build because when you need the measurement data the event is usually in the past and difficult to reproduce. With collectl running I have a week to recognise the event and analyse/save the relevant data file(s); data file approx. 21MB/node/day gzipped. With a little bit of bash or awk or perl scripting you can convert the collectl output into a form easily loadable into Pig. Pig also has User Defined Functions (UDFs) that can import the Hadoop job history so with some Pig Latin you can marry your infrastructure metrics with your job metrics; a bit like the cluster eating it own dog food. BTW, watch out for a little gotcha with Ganglia. It doesn't seem to report the full jvm metrics via gmond although if you output the jvm metrics to file you get a record for each jvm on the node. I haven't looked into it in detail yet but it looks like Gangla only reports the last jvm record in each batch. Anyone else seen this? Chris On 24 May 2011 01:48, Tom Melendez t...@supertom.com wrote: Hi Folks, I'm looking for tips, tricks and tools to get at node utilization to optimize our cluster. I want answer questions like: - what nodes ran a particular job? - how long did it take for those nodes to run the tasks for that job? - how/why did Hadoop pick those nodes to begin with? More detailed questions like - how much memory did the task for the job use on that node? - average CPU load on that node during the task run And more aggregate questions like: - are some nodes favored more than others? - utilization averages (generally, how many cores on that node are in use, etc.) There are plenty more that I'm not asking, but you get the point? So, what are you guys using for this? I see some mentions of Ganglia, so I'll definitely look into that. Anything else? Anything you're using to monitor in real-time (like a 'top' across the nodes or something like that)? Any info or war-stories greatly appreciated. Thanks, Tom
Re: Processing xml files
Can you please give more info? We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB. Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html If you have well define schema it is easier to work with big data :) Any readings suggested in this area? Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Hadoop has build in counter, did you look into word count example from hadoop? Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Re: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:41 PM On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello, We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming Can you please give more info? Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? Any readings suggested in this area? Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Sorry for all the questions. I am new and trying to get a grasp and also learn how would I actually solve our use case. Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:16 PM I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers.
Re: EC2 cloudera cc1.4xlarge
I look into different cluster and configurations from cloudera and came with this number let me know what do you think... Machine 23 GB of memory 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core “Nehalem” architecture) 1690 GB of instance storage 64-bit platform I/O Performance: Very High (10 Gigabit Ethernet) API name: cc1.4xlarge MAX_MAP_TASKS=16 - mapred.tasktracker.map.tasks.maximum MAX_REDUCE_TASKS=8 - mapred.tasktracker.reduce.tasks.maximum CHILD_OPTS=-Xmx1024m - mapred.child.java.opts CHILD_ULIMIT=1392640 - mapred.child.ulimit Regards, Aleksandr --- On Tue, 5/24/11, Aleksandr Elbakyan ramal...@yahoo.com wrote: From: Aleksandr Elbakyan ramal...@yahoo.com Subject: EC2 cloudera cc1.4xlarge To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:23 PM Hello, I am want to use cc1.4xlarge cluster for some data processing, to spin clusters I am using cloudera scripts. hadoop-ec2-init-remote.sh has default configuration until c1.xlarge but not configuration for cc1.4xlarge, can someone give formula how does this values calculated based on hardware? C1.XLARGE MAX_MAP_TASKS=8 - mapred.tasktracker.map.tasks.maximum MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum CHILD_OPTS=-Xmx680m - mapred.child.java.opts CHILD_ULIMIT=1392640 - mapred.child.ulimit I am guessing but I think CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS) But not sure how to calculate rest Regards, Aleksandr
Checkpoint vs Backup Node
As far as my understanding goes, I feel that Backup node is much more efficient then the Checkpoint node, as it has the current(up-to-date) copy of file system too. I do not understand what would be the use case (in a production environment) tin which someone would prefer Checkpoint node over Backup node, or I should ask, what do people generally prefer of the two and why ?
Re: Processing xml files
Thanks some more questions :) On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Can you please give more info? We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB. Do you convert it into a flat file and write it to HDFS? Do you write all the files to the same directory in DFS or do you group directories based on days for eg? So like 2011/01/01 contains 10 files. store results of 10 files somewhere and then on 2011/02/02 store another say 20 files. Now analyze 20 files and use the results from 10 files to do the aggregation. If so then how do you do it. Or how should I do it since it will be overhead processing those files again. Please point me to examples so that you don't have to teach me Hadoop or pig processing :) Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it. Didn't think about it. I was just thinking in terms of using big files. So when using small files hadoop will automatically distribute the files accross cluster I am assuming based on some hashing. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html Which one is better? Converting files to flat files or using xml as is? How do I make that decision? If you have well define schema it is easier to work with big data :) Any readings suggested in this area? Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer I will download pig tutorial and see how that works. Is there any other xml related examples you can point me to? Thanks a lot! Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Hadoop has build in counter, did you look into word count example from hadoop? Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Re: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:41 PM On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello, We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming Can you please give more info? Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? Any readings suggested in this area? Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Sorry for all the questions. I am new and trying to get a grasp and also learn how would I actually solve our use case. Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:16 PM I just started learning hadoop and got done with wordcount mapreduce example. I also briefly looked at hadoop streaming. Some questions 1) What should be my first step now? Are there more examples somewhere that I can try out? 2) Second question is around pracitcal usability using xml files. Our xml files are not big they are around 120k in size but hadoop is really meant for big files so how do I go about processing these xml files? 3) Are there any samples or advise on how to processing with xml files? Looking for help and pointers.
Re: Checkpoint vs Backup Node
Hi Sulabh, Neither of these nodes have been productionized -- so I don't think anyone would have a good answer for you about what works in production. They are only available in 0.21 and haven't had any substantial QA. One of the potential issues with the BN is that it can delay the logging of edits by the primary NN, if the BN were to hang or go offline. The CN would not have such an issue. -Todd On Tue, May 24, 2011 at 5:08 PM, sulabh choudhury sula...@gmail.com wrote: As far as my understanding goes, I feel that Backup node is much more efficient then the Checkpoint node, as it has the current(up-to-date) copy of file system too. I do not understand what would be the use case (in a production environment) tin which someone would prefer Checkpoint node over Backup node, or I should ask, what do people generally prefer of the two and why ? -- Todd Lipcon Software Engineer, Cloudera
Re: EC2 cloudera cc1.4xlarge
Try cloudera specific lisls with your questions. -- Take care, Konstantin (Cos) Boudnik 2CAC 8312 4870 D885 8616 6115 220F 6980 1F27 E622 Disclaimer: Opinions expressed in this email are those of the author, and do not necessarily represent the views of any company the author might be affiliated with at the moment of writing. On Tue, May 24, 2011 at 16:23, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello, I am want to use cc1.4xlarge cluster for some data processing, to spin clusters I am using cloudera scripts. hadoop-ec2-init-remote.sh has default configuration until c1.xlarge but not configuration for cc1.4xlarge, can someone give formula how does this values calculated based on hardware? C1.XLARGE MAX_MAP_TASKS=8 - mapred.tasktracker.map.tasks.maximum MAX_REDUCE_TASKS=4 - mapred.tasktracker.reduce.tasks.maximum CHILD_OPTS=-Xmx680m - mapred.child.java.opts CHILD_ULIMIT=1392640 - mapred.child.ulimit I am guessing but I think CHILD_OPTS = (total ram on the box - 1gb) /(MAX_MAP_TASKS, MAX_REDUCE_TASKS) But not sure how to calculate rest Regards, Aleksandr
Re: Processing xml files
Hello, We currently have complicated process which has more then 20 jobs piped to each other. We are using shell script to control the flow, I saw some other company they were using spring batch. We use pig, streaming and hive Not one thing if you are using ec2 for your jobs all local files need to be stored in /mnt Currently our cluster is organized this way in hdfs, and we process our data hourly and rotate final result to the beginning of pipeline for next process. Each process output is next process input, so we keep all data for current execution in the same dated folder so if you run daily it will be eg 20111212 if hourly 201112121416, and add subfolder for each subprocess into it Example /user/{domain}/{date}/input /user/{domain}/{date}/process1 /user/{domain}/{date}/process2 /user/{domain}/{date}/process3 /user/{domain}/{date}/process4 our process1 takes input for current converted files and output from last process. After we start the job we load converted files into input location and move them out from local space so we will not reprocess them. Not sure if there is examples, this all will depend on architecture of the project you are doing, I bet if you will put all you need to do on whiteboard you will find best folder structure for yourself :) Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Re: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 5:20 PM Thanks some more questions :) On Tue, May 24, 2011 at 4:54 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Can you please give more info? We currently have off hadoop process which uses java xml parser to convert it to flat file. We have files from couple kb to 10of GB. Do you convert it into a flat file and write it to HDFS? Do you write all the files to the same directory in DFS or do you group directories based on days for eg? So like 2011/01/01 contains 10 files. store results of 10 files somewhere and then on 2011/02/02 store another say 20 files. Now analyze 20 files and use the results from 10 files to do the aggregation. If so then how do you do it. Or how should I do it since it will be overhead processing those files again. Please point me to examples so that you don't have to teach me Hadoop or pig processing :) Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. We currently feed to our process folder with converted files. We don't size it any way we let hadoop to handle it. Didn't think about it. I was just thinking in terms of using big files. So when using small files hadoop will automatically distribute the files accross cluster I am assuming based on some hashing. how do you create these files assuming your xml is stored somewhere else in the DB or filesystem? read them one by one? what are your experiences using text files instead of xml? If you are using streaming job it is easier to build your logic if you have one file, you can actually try to parse xml in your mapper and convert it for reducer but, why you just don't write small app which will convert it? Reason why xml files can't be directly used in hadoop or shouldn't be used? Any performance implications? If you are using Pig there is XML reader http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/XMLLoader.html Which one is better? Converting files to flat files or using xml as is? How do I make that decision? If you have well define schema it is easier to work with big data :) Any readings suggested in this area? Try look into Pig it has lots of useful stuff, which will make your experience with hadoop nicer I will download pig tutorial and see how that works. Is there any other xml related examples you can point me to? Thanks a lot! Our xml is something like: column id=Name security=sensitive xsi:type=Text valuefree a last/value /column column id=age security=no xsi:type=Text value40/value /column And we would for eg want to know how many customers above certain age or certain age with certain income etc. Hadoop has build in counter, did you look into word count example from hadoop? Regards, Aleksandr --- On Tue, 5/24/11, Mohit Anchlia mohitanch...@gmail.com wrote: From: Mohit Anchlia mohitanch...@gmail.com Subject: Re: Processing xml files To: common-user@hadoop.apache.org Date: Tuesday, May 24, 2011, 4:41 PM On Tue, May 24, 2011 at 4:25 PM, Aleksandr Elbakyan ramal...@yahoo.com wrote: Hello, We have the same type of data, we currently convert it to tab delimited file and use it as input for streaming Can you please give more info? Do you append multiple xml files data as a line into one file? Or someother way? If so then how big do you let files to be. how do you create these files assuming your xml is stored somewhere else in
Cannot lock storage, directory is already locked
Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server.common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. Any ideas on how to solve this error? Thanks, Mark
I can't see this email ... So to clarify ..
Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark
Re: I can't see this email ... So to clarify ..
Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote: Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: I can't see this email ... So to clarify ..
Well, you're right ... moving it to hdfs-site.xml had an effect at least. But now I'm in the NameSpace incompatable error: WARN org.apache.hadoop.hdfs.server.common.Util: Path /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-maha/dfs/data My configuration for this part in hdfs-site.xml: configuration property namedfs.data.dir/name value/tmp/hadoop-mark/dfs/data/value /property property namedfs.name.dir/name value/tmp/hadoop-mark/dfs/name/value /property property namehadoop.tmp.dir/name value/cs/student/mark/tmp/hodhod/value /property /configuration The reason why I want to change hadoop.tmp.dir is because the student quota under /tmp is small so I wanted to mount on /cs/student instead for hadoop.tmp.dir. Thanks, Mark On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com wrote: Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote: Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: I can't see this email ... So to clarify ..
Do u Hv right permissions on the new dirs ? Try stopping n starting cluster... -JJ On May 24, 2011, at 9:13 PM, Mark question markq2...@gmail.com wrote: Well, you're right ... moving it to hdfs-site.xml had an effect at least. But now I'm in the NameSpace incompatable error: WARN org.apache.hadoop.hdfs.server.common.Util: Path /tmp/hadoop-mark/dfs/data should be specified as a URI in configuration files. Please update hdfs configuration. java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-maha/dfs/data My configuration for this part in hdfs-site.xml: configuration property namedfs.data.dir/name value/tmp/hadoop-mark/dfs/data/value /property property namedfs.name.dir/name value/tmp/hadoop-mark/dfs/name/value /property property namehadoop.tmp.dir/name value/cs/student/mark/tmp/hodhod/value /property /configuration The reason why I want to change hadoop.tmp.dir is because the student quota under /tmp is small so I wanted to mount on /cs/student instead for hadoop.tmp.dir. Thanks, Mark On Tue, May 24, 2011 at 7:25 PM, Joey Echeverria j...@cloudera.com wrote: Try moving the the configuration to hdfs-site.xml. One word of warning, if you use /tmp to store your HDFS data, you risk data loss. On many operating systems, files and directories in /tmp are automatically deleted. -Joey On Tue, May 24, 2011 at 10:22 PM, Mark question markq2...@gmail.com wrote: Hi guys, I'm using an NFS cluster consisting of 30 machines, but only specified 3 of the nodes to be my hadoop cluster. So my problem is this. Datanode won't start in one of the nodes because of the following error: org.apache.hadoop.hdfs.server. common.Storage: Cannot lock storage /cs/student/mark/tmp/hodhod/dfs/data. The directory is already locked I think it's because of the NFS property which allows one node to lock it then the second node can't lock it. So I had to change the following configuration: dfs.data.dir to be /tmp/hadoop-user/dfs/data But this configuration is overwritten by ${hadoop.tmp.dir}/dfs/data where my hadoop.tmp.dir = /cs/student/mark/tmp as you might guess from above. Where is this configuration over-written ? I thought my core-site.xml has the final configuration values. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
LeaseExpirationException and 'leaseholder failing to recreate file': Could anything be done run-time?
Hi All, I am running a process to extract feature vectors from images and write as SequenceFiles on HDFS. My dataset of images is very large (~46K images). The writing process worked all fine for half of the process but all of sudden following problem occured: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /mnt/tmp/sirs-dataset-k1/feature-repo/features/109817 for DFSClient_148861898 on client 10.118.177.84 because current leaseholder is trying to recreate file. On investigating, I found that the error started generating after a LeaseExpirationException : dfs.server.namenode.LeaseExpiredException: No lease on /mnt/tmp/sirs-dataset-k1/feature-repo/features/109817 File is not open for writing. [Lease. Holder: DFSClient_148861898, pendingcreates: 1] org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /mnt/tmp/sirs-dataset-k1/feature-repo/features/109817 File is not open for writing. [Lease. Holder: DFSClient_148861898, pendingcreates: 1] The process has already taken me 18-19 hrs and it would be very tough for me to restart the whole process. Is there anything which can be done to fix it run-time ? ( may be force-deleting the concerned file '/mnt/tmp/sirs-dataset-k1/feature-repo/features/109817' on HDFS ?) Regards Lokendra *Detailed Log:* 2011-05-25 04:03:32,160 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 54310, call addBlock(/mnt/tmp/sirs-dataset-k1/feature-repo/features/109817, DFSClient_148861898) from 10.118.177.84:48372: error: org.apache.hadoop.h dfs.server.namenode.LeaseExpiredException: No lease on /mnt/tmp/sirs-dataset-k1/feature-repo/features/109817 File is not open for writing. [Lease. Holder: DFSClient_148861898, pendingcreates: 1] org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /mnt/tmp/sirs-dataset-k1/feature-repo/features/109817 File is not open for writing. [Lease. Holder: DFSClient_148861898, pendingcreates: 1] at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1340) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 2011-05-25 04:03:32,175 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_-4965605132591592561 is added to invalidSet of 10.118.177.84:50010 2011-05-25 04:03:32,207 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=johndoe,johndoeip=/10.118.177.84cmd=delete src=/mnt/tmp/sirs-dataset-k1/feature-repo/imageListdst=null perm=null 2011-05-25 04:03:32,212 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=johndoe,johndoeip=/10.118.177.84cmd=create src=/mnt/tmp/sirs-dataset-k1/feature-repo/imageListdst=null perm=johndoe :supergroup:rw-r--r-- 2011-05-25 04:03:32,215 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /mnt/tmp/sirs-dataset-k1/feature-repo/imageList. blk_6557263107434203565_332695 2011-05-25 04:03:32,695 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.registerDatanode: node registration from 10.118.177.84:50010storage DS-199406591-10.118.177.84-50010-1306165949296 2011-05-25 04:03:32,696 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: /default-rack/10.118.177.84:50010 2011-05-25 04:03:32,696 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: /default-rack/10.118.177.84:50010 2011-05-25 04:03:33,045 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.100.245.5:50010 is added to blk_6557263107434203565_332695 size 11746349 2011-05-25 04:03:33,045 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /mnt/tmp/sirs-dataset-k1/feature-repo/imageList is closed by DFSClient_148861898 2011-05-25 04:03:33,404 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=johndoe,johndoeip=/10.118.177.84cmd=delete src=/mnt/tmp/sirs-dataset-k1/feature-repo/features/109817dst=null perm =null 2011-05-25 04:03:33,405 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=johndoe,johndoeip=/10.118.177.84cmd=create