Re: fs.s3a.endpoint not working
http://www.jets3t.org/toolkit/configuration.html On Jan 14, 2016 10:56 AM, "Alexander Pivovarov" <apivova...@gmail.com> wrote: > Add jets3t.properties file with s3service.s3-endpoint= to > /etc/hadoop/conf folder > > The folder with the file should be in HADOOP_CLASSPATH > > JetS3t library which is used by hadoop is looking for this file. > On Dec 22, 2015 12:39 PM, "Phillips, Caleb" <caleb.phill...@nrel.gov> > wrote: > >> Hi All, >> >> New to this list. Looking for a bit of help: >> >> I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object >> store. >> >> This issue was discussed, but left unresolved, in this thread: >> >> >> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3cca+0w_au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com%3E >> >> And here, on Cloudera's forums (the second post is mine): >> >> >> https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoint-ignored-in-hdfs-site-xml/m-p/33694#M1180 >> >> I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using >> Hadoop, I'm able to connect to S3 on AWS, and e.g., list/put/get files. >> >> However, when I point the fs.s3a.endpoint configuration directive at my >> non-AWS S3-Compatable object storage, it appears to still point at (and >> authenticate against) AWS. >> >> I've checked and double-checked my credentials and configuration using >> both Python's boto library and the s3cmd tool, both of which connect to >> this non-AWS data store just fine. >> >> Any help would be much appreciated. Thanks! >> >> -- >> Caleb Phillips, Ph.D. >> Data Scientist | Computational Science Center >> >> National Renewable Energy Laboratory (NREL) >> 15013 Denver West Parkway | Golden, CO 80401 >> 303-275-4297 | caleb.phill...@nrel.gov >> >> - >> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org >> For additional commands, e-mail: user-h...@hadoop.apache.org >> >>
Re: fs.s3a.endpoint not working
Add jets3t.properties file with s3service.s3-endpoint= to /etc/hadoop/conf folder The folder with the file should be in HADOOP_CLASSPATH JetS3t library which is used by hadoop is looking for this file. On Dec 22, 2015 12:39 PM, "Phillips, Caleb"wrote: > Hi All, > > New to this list. Looking for a bit of help: > > I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object > store. > > This issue was discussed, but left unresolved, in this thread: > > > https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3cca+0w_au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com%3E > > And here, on Cloudera's forums (the second post is mine): > > > https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoint-ignored-in-hdfs-site-xml/m-p/33694#M1180 > > I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using Hadoop, > I'm able to connect to S3 on AWS, and e.g., list/put/get files. > > However, when I point the fs.s3a.endpoint configuration directive at my > non-AWS S3-Compatable object storage, it appears to still point at (and > authenticate against) AWS. > > I've checked and double-checked my credentials and configuration using > both Python's boto library and the s3cmd tool, both of which connect to > this non-AWS data store just fine. > > Any help would be much appreciated. Thanks! > > -- > Caleb Phillips, Ph.D. > Data Scientist | Computational Science Center > > National Renewable Energy Laboratory (NREL) > 15013 Denver West Parkway | Golden, CO 80401 > 303-275-4297 | caleb.phill...@nrel.gov > > - > To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org > For additional commands, e-mail: user-h...@hadoop.apache.org > >
What settings I need to access remove HA cluster.
Hi Everyone I have 2 HA clusters mydev and myqa I want to have an ability to access hdfs://myqa/ paths from mydev cluster boxes. What settings should I add to mydev hdfs-site.xml so that hadoop can resolve myqa HA alias to active NN? Thank you Alex
Re: How do I integrate Hadoop app development with Eclipse IDE?
1. create pom.xml for your project 2. add hadoop dependencies which you need 3. $ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true 4. import existing java project to eclipse On Wed, May 20, 2015 at 5:31 PM, Caesar Samsi caesarsa...@mac.com wrote: Hello, I’m embarking on my first tutorial and would like to have tooltip help as I hover my mouse pointer over Hadoop classes. I’ve found the Hadoop docs and Javadoc URL and configured them but the tooltips still don’t show up. Thanks you, Caesar.
Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)
I also noticed another error message in logs 10848 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor - Status: Failed 10849 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor - Vertex failed, vertexName=Map 32, vertexId=vertex_1431616132488_6430_1_24, diagnostics=[Vertex Input: dual initializer failed., org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.commons.logging.impl.SLF4JLocationAwareLog Serialization trace: LOG (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)] one of the WITH blocks had explode() UDTF I replaced it with select ... union all select ... union all select ... and query is working fine now. Do you know anything about UDTF and Kryo issues fixed after 0.13.1? On Fri, May 15, 2015 at 3:20 PM, Alexander Pivovarov apivova...@gmail.com wrote: Looks like it was fixed in hive-0.14 https://issues.apache.org/jira/browse/HIVE-7079 On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone I'm using hive-0.13.1 (HDP-2.1.5) and getting the following stacktrace if run my query (which has WITH block) via Oozie. (BTW, the query works fine in CLI) I can't put exact query but the structure is similar to create table my_consumer as with sacusaloan as (select distinct e,f,g from E) select A.a, A.b, A.c, if(sacusaloan.id is null, 0, 1) as sacusaloan_status from (select a,b,c from A) A left join sacusaloan on (...) 8799 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - /PERFLOG method=parse start=1431723485500 end=1431723485602 duration=102 from=org.apache.hadoop.hive.ql.Driver 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 8834 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Starting Semantic Analysis 8837 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Creating table wk_qualified_outsource_loan_consumer position=13 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed phase 1 of Semantic Analysis 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for source tables 8865 [main] ERROR hive.ql.metadata.Hive - NoSuchObjectException(message:default.sacusaloan table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy18.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268
Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)
Looks like I found it https://issues.apache.org/jira/browse/HIVE-9409 public class UDTFOperator ... - protected final Log LOG = LogFactory.getLog(this.getClass().getName()); + protected static final Log LOG = LogFactory.getLog(UDTFOperator.class.getName()); On Fri, May 15, 2015 at 4:17 PM, Alexander Pivovarov apivova...@gmail.com wrote: I also noticed another error message in logs 10848 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor - Status: Failed 10849 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor - Vertex failed, vertexName=Map 32, vertexId=vertex_1431616132488_6430_1_24, diagnostics=[Vertex Input: dual initializer failed., org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find class: org.apache.commons.logging.impl.SLF4JLocationAwareLog Serialization trace: LOG (org.apache.hadoop.hive.ql.exec.UDTFOperator) childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator) childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator) aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)] one of the WITH blocks had explode() UDTF I replaced it with select ... union all select ... union all select ... and query is working fine now. Do you know anything about UDTF and Kryo issues fixed after 0.13.1? On Fri, May 15, 2015 at 3:20 PM, Alexander Pivovarov apivova...@gmail.com wrote: Looks like it was fixed in hive-0.14 https://issues.apache.org/jira/browse/HIVE-7079 On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone I'm using hive-0.13.1 (HDP-2.1.5) and getting the following stacktrace if run my query (which has WITH block) via Oozie. (BTW, the query works fine in CLI) I can't put exact query but the structure is similar to create table my_consumer as with sacusaloan as (select distinct e,f,g from E) select A.a, A.b, A.c, if(sacusaloan.id is null, 0, 1) as sacusaloan_status from (select a,b,c from A) A left join sacusaloan on (...) 8799 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - /PERFLOG method=parse start=1431723485500 end=1431723485602 duration=102 from=org.apache.hadoop.hive.ql.Driver 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 8834 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Starting Semantic Analysis 8837 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Creating table wk_qualified_outsource_loan_consumer position=13 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed phase 1 of Semantic Analysis 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for source tables 8865 [main] ERROR hive.ql.metadata.Hive - NoSuchObjectException(message:default.sacusaloan table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy18.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323
Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)
Looks like it was fixed in hive-0.14 https://issues.apache.org/jira/browse/HIVE-7079 On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov apivova...@gmail.com wrote: Hi Everyone I'm using hive-0.13.1 (HDP-2.1.5) and getting the following stacktrace if run my query (which has WITH block) via Oozie. (BTW, the query works fine in CLI) I can't put exact query but the structure is similar to create table my_consumer as with sacusaloan as (select distinct e,f,g from E) select A.a, A.b, A.c, if(sacusaloan.id is null, 0, 1) as sacusaloan_status from (select a,b,c from A) A left join sacusaloan on (...) 8799 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - /PERFLOG method=parse start=1431723485500 end=1431723485602 duration=102 from=org.apache.hadoop.hive.ql.Driver 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 8834 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Starting Semantic Analysis 8837 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Creating table wk_qualified_outsource_loan_consumer position=13 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed phase 1 of Semantic Analysis 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for source tables 8865 [main] ERROR hive.ql.metadata.Hive - NoSuchObjectException(message:default.sacusaloan table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy18.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:316) at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:277) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38) at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:66) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method
query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)
Hi Everyone I'm using hive-0.13.1 (HDP-2.1.5) and getting the following stacktrace if run my query (which has WITH block) via Oozie. (BTW, the query works fine in CLI) I can't put exact query but the structure is similar to create table my_consumer as with sacusaloan as (select distinct e,f,g from E) select A.a, A.b, A.c, if(sacusaloan.id is null, 0, 1) as sacusaloan_status from (select a,b,c from A) A left join sacusaloan on (...) 8799 [main] INFO hive.ql.parse.ParseDriver - Parse Completed 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - /PERFLOG method=parse start=1431723485500 end=1431723485602 duration=102 from=org.apache.hadoop.hive.ql.Driver 8799 [main] INFO org.apache.hadoop.hive.ql.log.PerfLogger - PERFLOG method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver 8834 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Starting Semantic Analysis 8837 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Creating table wk_qualified_outsource_loan_consumer position=13 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Completed phase 1 of Semantic Analysis 8861 [main] INFO org.apache.hadoop.hive.ql.parse.SemanticAnalyzer - Get metadata for source tables 8865 [main] ERROR hive.ql.metadata.Hive - NoSuchObjectException(message:default.sacusaloan table not found) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89) at com.sun.proxy.$Proxy18.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427) at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323) at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906) at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268) at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423) at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359) at org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456) at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466) at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625) at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:316) at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:277) at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38) at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:66) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at
Re: how to load data
if your file is csv file then create table statement should specify CSVSerde - look at the examples under the links I sent you On Thu, Apr 30, 2015 at 10:23 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Alex, I followed the same steps as mentioned in the site. Once I load data into table which is create below Created table CREATE TABLE raw (line STRING) PARTITIONED BY (FISCAL_YEAR smallint, FISCAL_PERIOD smallint) STORED AS TEXTFILE; and loaded it with data. LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE raw; when I say select * from raw it shows all null values. NULLNULLNULLNULLNULLNULLNULLNULL NULLNULLNULLNULLNULLNULLNULLNULL NULLNULLNULLNULLNULLNULLNULLNULL NULLNULLNULLNULLNULLNULLNULLNULL Why is not show showing the actual data in file. will it show once I load it to parque table? Please let me know if I am doing anything wrong. I appreciate your help. Thanks jay Thank you very much for you help Alex, On Wed, Apr 29, 2015 at 3:43 PM, Alexander Pivovarov apivova...@gmail.com wrote: 1. Create external textfile hive table pointing to /extract/DBCLOC and specify CSVSerde if using hive-0.14 and newer use this https://cwiki.apache.org/confluence/display/Hive/CSV+Serde if hive-0.13 and older use https://github.com/ogrodnek/csv-serde You do not even need to unzgip the file. hive automatically unzgip data on select. 2. run simple query to load data insert overwrite table orc_table select * from csv_table On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Hello All, I have this table CREATE TABLE DBCLOC( BLwhse int COMMENT 'DECIMAL(5,0) Whse', BLsdat string COMMENT 'DATE Sales Date', BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#', BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#', BLscnr string COMMENT 'CHAR(1) Scenario', BLareq string COMMENT 'CHAR(1) Act Requested', BLatak string COMMENT 'CHAR(1) Act Taken', BLmsgc string COMMENT 'CHAR(3) Msg Code') PARTITIONED BY (FSCAL_YEAR smallint, FSCAL_PERIOD smallint) STORED AS PARQUET; have to load from hdfs location /extract/DBCLOC/DBCL0301P.csv.gz to the table above Can any one tell me what is the most efficient way of doing it. Thanks Jay
Re: how to load data
Follow the links I sent you already. On Apr 30, 2015 11:52 AM, Kumar Jayapal kjayapa...@gmail.com wrote: Hi Alex, How to create external textfile hive table pointing to /extract/DBCLOC and specify CSVSerde ? Thanks Jay On Wed, Apr 29, 2015 at 3:43 PM, Alexander Pivovarov apivova...@gmail.com wrote: 1. Create external textfile hive table pointing to /extract/DBCLOC and specify CSVSerde if using hive-0.14 and newer use this https://cwiki.apache.org/confluence/display/Hive/CSV+Serde if hive-0.13 and older use https://github.com/ogrodnek/csv-serde You do not even need to unzgip the file. hive automatically unzgip data on select. 2. run simple query to load data insert overwrite table orc_table select * from csv_table On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Hello All, I have this table CREATE TABLE DBCLOC( BLwhse int COMMENT 'DECIMAL(5,0) Whse', BLsdat string COMMENT 'DATE Sales Date', BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#', BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#', BLscnr string COMMENT 'CHAR(1) Scenario', BLareq string COMMENT 'CHAR(1) Act Requested', BLatak string COMMENT 'CHAR(1) Act Taken', BLmsgc string COMMENT 'CHAR(3) Msg Code') PARTITIONED BY (FSCAL_YEAR smallint, FSCAL_PERIOD smallint) STORED AS PARQUET; have to load from hdfs location /extract/DBCLOC/DBCL0301P.csv.gz to the table above Can any one tell me what is the most efficient way of doing it. Thanks Jay
Re: How to move back to .gz file from hive to hdfs
Try to find the file in hdfs trash On Apr 30, 2015 2:14 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Hi, I loaded one file to hive table it is in .gz extension. file is moved/deleted from hdfs. when I execute select command I get an error. Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2) how can I move back the file to HDFS. how can I do it. Thanks Jay
Re: How to move back to .gz file from hive to hdfs
try desc formatted table_name; it shows you table location on hdfs On Thu, Apr 30, 2015 at 2:43 PM, Kumar Jayapal kjayapa...@gmail.com wrote: I did not find it in .Trash file is moved to hive table I want to move it back to hdfs. On Thu, Apr 30, 2015 at 2:20 PM, Alexander Pivovarov apivova...@gmail.com wrote: Try to find the file in hdfs trash On Apr 30, 2015 2:14 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Hi, I loaded one file to hive table it is in .gz extension. file is moved/deleted from hdfs. when I execute select command I get an error. Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask (state=08S01,code=2) how can I move back the file to HDFS. how can I do it. Thanks Jay
Re: how to load data
1. Create external textfile hive table pointing to /extract/DBCLOC and specify CSVSerde if using hive-0.14 and newer use this https://cwiki.apache.org/confluence/display/Hive/CSV+Serde if hive-0.13 and older use https://github.com/ogrodnek/csv-serde You do not even need to unzgip the file. hive automatically unzgip data on select. 2. run simple query to load data insert overwrite table orc_table select * from csv_table On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com wrote: Hello All, I have this table CREATE TABLE DBCLOC( BLwhse int COMMENT 'DECIMAL(5,0) Whse', BLsdat string COMMENT 'DATE Sales Date', BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#', BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#', BLscnr string COMMENT 'CHAR(1) Scenario', BLareq string COMMENT 'CHAR(1) Act Requested', BLatak string COMMENT 'CHAR(1) Act Taken', BLmsgc string COMMENT 'CHAR(3) Msg Code') PARTITIONED BY (FSCAL_YEAR smallint, FSCAL_PERIOD smallint) STORED AS PARQUET; have to load from hdfs location /extract/DBCLOC/DBCL0301P.csv.gz to the table above Can any one tell me what is the most efficient way of doing it. Thanks Jay
Re: sorting in hive -- general
1. sort by - key are distributed according to MR partitioner (controlled by distributed by in hive) Lets assume hash partitioned uses the same column as sort by and uses x mod 16 formula to get reducer id reduced 0 will have keys 0 16 32 reducer 1 will have keys 1 17 33 if you merge reducer 0 and reducer 1 output you will have 0 16 32 1 17 33 2. order by will use 1 reducer and hive will send all keys to reducer 0 So order by in hive works different from terasort. In case of terasort you can merge output files and get one file with globally sorted data. On Sun, Mar 8, 2015 at 7:55 AM, max scalf oracle.bl...@gmail.com wrote: Thank you Alexander. So is it fair to assume when sort by is used and multiple files are produced per reducer at the end of it all of then are put togeather/merged to get the results back? And can sort by be used without distributed by and expect same result as order by ? On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com wrote: sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote: Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: sorting in hive -- general
sort by query produces multiple independent files. order by - just one file usually sort by is used with distributed by. In older hive versions (0.7) they might be used to implement local sort within partition similar to RANK() OVER (PARTITION BY A ORDER BY B) On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote: Hello all, I am a new to hadoop and hive in general and i am reading hadoop the definitive guide by Tom White and on page 504 for the hive chapter, Tom says below with regards to soritng *Sorting and Aggregating* *Sorting data in Hive can be achieved by using a standard ORDER BY clause. ORDER BY performs a parallel total sort of the input (like that described in “Total Sort” on page 261). When a globally sorted result is not required—and in many cases it isn’t—you can use Hive’s nonstandard extension, SORT BY, instead. SORT BY produces a sorted file per reducer.* My Questions is, what exactly does he mean by globally sorted result?, if the sort by operation produces a sorted file per reducer does that mean at the end of the sort all the reducer are put back together to give the correct results ?
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
what about DNS? if you have 2 computers (nn and dn) how nn knows dn ip? The script puts only this computer ip to /etc/hosts On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote: Here is a easy way to go about assigning static name to your ec2 instance. When you get the launch an EC2-instance from aws console when you get to the point of selecting VPC, ip address screen there is a screen that says USER DATA...put the below in with appropriate host name(change CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get you static name. #!/bin/bash HOSTNAME_TAG=CHANGE_HOST_NAME_HERE cat /etc/sysconfig/network EOF NETWORKING=yes NETWORKING_IPV6=no HOSTNAME=${HOSTNAME_TAG} EOF IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4) echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG} /etc/hosts echo ${HOSTNAME_TAG} /proc/sys/kernel/hostname service network restart Also note i was able to do this on couple of spot instance for cheap price, only thing is once you shut it down or someone outbids you, you loose that instance but its easy/cheap to play around with and i have used couple of m3.medium for my NN/SNN and couple of them for data nodes... On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
I think EMR has its own limitation e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my hive patch. How EMR can help me? it supports hadoop up to 2.4.0 (not even 2.4.1) http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: Hi guys I know you guys want to keep costs down, but why go through all the effort to setup ec2 instances when you deploy EMR it takes the time to provision and setup the ec2 instances for you. All configuration then for the entire cluster is done on the master node of the particular cluster or setting up of additional software that is all done through the EMR console. We were doing some geospatial calculations and we loaded a 3rd party jar file called esri into the EMR cluster. I then had to pass a small bootstrap action (script) to have it distribute esri to the entire cluster. Why are you guys reinventing the wheel? --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 03:35, Alexander Pivovarov wrote: I found the following solution to this problem I registered 2 subdomains (public and local) for each computer on https://freedns.afraid.org/subdomain/ e.g. myhadoop-nn.crabdance.com myhadoop-nn-local.crabdance.com then I added cron job which sends http requests to update public and local ip on freedns server hint: public ip is detected automatically ip address for local name can be set using request parameter address=10.x.x.x (don't forget to escape ) as a result my nn computer has 2 DNS names with currently assigned ip addresses , e.g. myhadoop-nn.crabdance.com 54.203.181.177 myhadoop-nn-local.crabdance.com 10.220.149.103 in hadoop configuration I can use local machine names to access my cluster outside of AWS I can use public names Just curious if AWS provides easier way to name EC2 computers? On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
ok, how we can easily put all hadoop computer names and IPs to /etc/hosts on all computers? Do you have a script? or I need manually go to each computer, get its ip and put it to /etc/hosts and then distribute /etc/hosts to all machines? Don't you think one time effort to configure freedns is easier? freedns solution works with AWS spot-instances as well. You need to create snapshot after you configure freedns, hadoop, etc on particular box. Next time you need computer you can can go to your saved snapshots and create spot-instance from it. On Thu, Mar 5, 2015 at 6:54 PM, max scalf oracle.bl...@gmail.com wrote: unfortunately without DNS you have to rely on /etc/hosts, so put in entry for all your nodes(nn,snn,dn1,dn2 etc..) on all nodes(/etc/hosts file) and i have that tested for hortonworks(using ambari) and cloudera manager and i am certainly sure it will work for MapR On Thu, Mar 5, 2015 at 8:47 PM, Alexander Pivovarov apivova...@gmail.com wrote: what about DNS? if you have 2 computers (nn and dn) how nn knows dn ip? The script puts only this computer ip to /etc/hosts On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote: Here is a easy way to go about assigning static name to your ec2 instance. When you get the launch an EC2-instance from aws console when you get to the point of selecting VPC, ip address screen there is a screen that says USER DATA...put the below in with appropriate host name(change CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get you static name. #!/bin/bash HOSTNAME_TAG=CHANGE_HOST_NAME_HERE cat /etc/sysconfig/network EOF NETWORKING=yes NETWORKING_IPV6=no HOSTNAME=${HOSTNAME_TAG} EOF IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4) echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG} /etc/hosts echo ${HOSTNAME_TAG} /proc/sys/kernel/hostname service network restart Also note i was able to do this on couple of spot instance for cheap price, only thing is once you shut it down or someone outbids you, you loose that instance but its easy/cheap to play around with and i have used couple of m3.medium for my NN/SNN and couple of them for data nodes... On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: I dont know how you would do that to be honest. With EMR you have destinctions master core and task nodes. If you need to change configuration you just ssh into the EMR master node. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 02:11, Alexander Pivovarov wrote: What is the easiest way to assign names to aws ec2 computers? I guess computer need static hostname and dns name before it can be used in hadoop cluster. On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: When I started with EMR it was alot of testing and trial and error. HUE is already supported as something that can be installed from the AWS console. What I need to know is if you need this cluster on all the time or this is goign ot be what amazon call a transient cluster. Meaning you fire it up run the job and tear it back down. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 01:10, Krish Donald wrote: Thanks Jonathan, I will try to explore EMR option also. Can you please let me know the configuration which you have used it? Can you please recommend for me also? I would like to setup Hadoop cluster using cloudera manager and then would like to do below things: setup kerberos setup federation setup monitoring setup hadr backup and recovery authorization using sentry backup and recovery of individual componenets performamce tuning upgrade of cdh upgrade of CM Hue User Administration Spark Solr Thanks Krish On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote: krish EMR wont cost you much with all the testing and data we ran through the test systems as well as the large amont of data when everythign was read we paid about 15.00 USD. I honestly do not think that the specs there would be enough as java can be pretty ram hungry. --- Regards, Jonathan Aquilina Founder Eagle Eye T On 2015-03-06 00:41, Krish Donald wrote: Hi, I am new to AWS and would like to setup Hadoop cluster using cloudera manager for 6-7 nodes. t2.micro on AWS; Is it enough for setting up Hadoop cluster ? I would like to use free service as of now. Please advise. Thanks Krish
Re: Kerberos Security in Hadoop
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Managing_Smart_Cards/Configuring_a_Kerberos_5_Server.html On Wed, Feb 18, 2015 at 4:49 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, Has anybody worked on Kerberos security on Hadoop ? Can you please guide me , any document link will be appreciated ? Thanks Krish
Re: Kerberos Security in Hadoop
http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_sg_authentication.html On Wed, Feb 18, 2015 at 5:49 PM, Manoj Samel manojsamelt...@gmail.com wrote: Cloureda also has good documentation on setting up kerberos based cluser - This can be used even if you are not using cloudera manager to setup your cluster. On Wed, Feb 18, 2015 at 4:51 PM, Alexander Pivovarov apivova...@gmail.com wrote: http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Managing_Smart_Cards/Configuring_a_Kerberos_5_Server.html On Wed, Feb 18, 2015 at 4:49 PM, Krish Donald gotomyp...@gmail.com wrote: Hi, Has anybody worked on Kerberos security on Hadoop ? Can you please guide me , any document link will be appreciated ? Thanks Krish
Re: Copying many files to HDFS
Hi Kevin, What is network throughput btw 1. NFS server and client machine? 2. client machine and dananodes? Alex On Feb 13, 2015 5:29 AM, Kevin kevin.macksa...@gmail.com wrote: Hi, I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand or so files into HDFS, which totals roughly 1 TB. The cluster will be isolated on its own private LAN with a single client machine that is connected to the Hadoop cluster as well as the public network. The data that needs to be copied into HDFS is mounted as an NFS on the client machine. I can run `hadoop fs -put` concurrently on the client machine to try and increase the throughput. If these files were able to be accessed by each node in the Hadoop cluster, then I could write a MapReduce job to copy a number of files from the network into HDFS. I could not find anything in the documentation saying that `distcp` works with locally hosted files (its code in the tools package doesn't tell any sign of it either) - but I wouldn't expect it to. In general, are there any other ways of copying a very large number of client-local files to HDFS? I search the mail archives to find a similar question and I didn't come across one. I'm sorry if this is a duplicate question. Thanks for your time, Kevin
Re: Building for Windows
try mvn package -Pdist -Dtar -DskipTests On Wed, Feb 11, 2015 at 2:02 PM, Lucio Crusca lu...@sulweb.org wrote: Hello everybody, I'm absolutely new to hadoop and a customer asked me to build version 2.6 for Windows Server 2012 R2. I'm myself a java programmer, among other things, but I've never used hadoop before. I've downloaded and installed JDK7, Maven, Cygwin (for sh, mv, gzip, ...) and other toys specified in the BUILDING.txt file bundled with hadoop sources. I also set the PATH and other environment variables (JAVA_HOME, ZLIB_HOME, Platform, ...). Running mvn package -Pdist -Dtar it compiles everything till MiniKDC included, but then it fails a test after compiling Auth. Here you can see the full output of the mvn command: http://hastebin.com/aqixebojuv.tex Can you help me understand what I'm doing wrong? Thanks in advance Lucio.
Re: Building for Windows
in addition to skipTests you want to add native-win profile mvn clean package -Pdist,native-win -DskipTests -Dtar this command must be run from a Windows SDK command prompt (not cygwin) as documented in BUILDING.txt. A successful build generates a binary hadoop .tar.gz package in hadoop-dist\target\. https://wiki.apache.org/hadoop/Hadoop2OnWindows https://svn.apache.org/viewvc/hadoop/common/branches/branch-2/BUILDING.txt?view=markup On Wed, Feb 11, 2015 at 3:09 PM, Alexander Pivovarov apivova...@gmail.com wrote: try mvn package -Pdist -Dtar -DskipTests On Wed, Feb 11, 2015 at 2:02 PM, Lucio Crusca lu...@sulweb.org wrote: Hello everybody, I'm absolutely new to hadoop and a customer asked me to build version 2.6 for Windows Server 2012 R2. I'm myself a java programmer, among other things, but I've never used hadoop before. I've downloaded and installed JDK7, Maven, Cygwin (for sh, mv, gzip, ...) and other toys specified in the BUILDING.txt file bundled with hadoop sources. I also set the PATH and other environment variables (JAVA_HOME, ZLIB_HOME, Platform, ...). Running mvn package -Pdist -Dtar it compiles everything till MiniKDC included, but then it fails a test after compiling Auth. Here you can see the full output of the mvn command: http://hastebin.com/aqixebojuv.tex Can you help me understand what I'm doing wrong? Thanks in advance Lucio.
Re: Building for Windows
There are about 3000 tests It should be particular box configuration to run all tests successfully you should have lots of memory It takes min 1 hour to run all tests Look at hadoop pre-commit builds on jenkins https://builds.apache.org/job/PreCommit-HADOOP-Build/ On Wed, Feb 11, 2015 at 3:55 PM, Lucio Crusca lu...@sulweb.org wrote: In data mercoledì 11 febbraio 2015 15:17:23, Alexander Pivovarov ha scritto: in addition to skipTests you want to add native-win profile mvn clean package -Pdist,native-win -DskipTests -Dtar Ok thanks but... what's the point of having tests in place if you have to skip them in order to build? this command must be run from a Windows SDK command prompt (not cygwin) Yes I was already doing that, Cygwin is only installed to provide a few required unix commands (from BUILDING.txt: * Unix command-line tools from GnuWin32 or Cygwin: sh, mkdir, rm, cp, tar, gzip). For some reason I don't remember I had problems with GnuWin32 and went for Cygwin instead.
Re: Multiple separate Hadoop clusters on same physical machines
start several vms and install hadoop on each vm keywords: kvm, QEMU On Mon, Jan 26, 2015 at 1:18 AM, Harun Reşit Zafer harun.za...@tubitak.gov.tr wrote: Hi everyone, We have set up and been playing with Hadoop 1.2.x and its friends (Hbase, pig, hive etc.) on 7 physical servers. We want to test Hadoop (maybe different versions) and ecosystem on physical machines (virtualization is not an option) from different perspectives. As a bunch of developer we would like to work in parallel. We want every team member play with his/her own cluster. However we have limited amount of servers (strong machines though). So the question is, by changing port numbers, environment variables and other configuration parameters, is it possible to setup several independent clusters on same physical machines. Is there any constraints? What are the possible difficulties we are to face? Thanks in advance -- Harun Reşit Zafer TÜBİTAK BİLGEM BTE Bulut Bilişim ve Büyük Veri Analiz Sistemleri Bölümü T +90 262 675 3268 W http://www.hrzafer.com
Re: way to add custom udf jar in hadoop 2.x version
I found that the easiest way is to put udf jar to /usr/lib/hadoop-mapred on all computers in the cluster. Hive cli, hiveserver2, oozie launcher, oozie hive action, mr will see the jar then. I'm using hdp-2.1.5 On Dec 30, 2014 10:58 PM, reena upadhyay reena2...@gmail.com wrote: Hi, I am using hadoop 2.4.0 version. I have created custom udf jar. I am trying to execute a simple select udf query using java hive jdbc client program. When hive execute the query using map reduce job, then the query execution get fails because the mapper is not able to locate the udf class. So I wanted to add the udf jar in hadoop environment permanently. Please suggest me a way to add this external jar for single node and multi node hadoop cluster. PS: I am using hive 0.13.1 version and I already have this custom udf jar added in HIVE_HOME/lib directory. Thanks
Re: Hardware requirements for simple node hadoop cluster
for balanced conf you need (per core) 1-1.5 2 GB 7200 SATA hdd for hdfs (in JBOD mode, not RAID) 3-4 GB RAM ECC reserve 4GB RAM for OS better to use separate hdd or usb stick for OS e.g. for 16 cores you can use 16-24 2GB hdds 64 GB RAM (if planing to use Apache Spark put 128 GB) On Sun, Dec 7, 2014 at 12:08 AM, Amjad Syed amjad...@gmail.com wrote: Hello, We are trying to do a proof of concept at our Data center with two node hadoop cluster. We have two (dual socket quad core ) HP proliant DL 380G6 servers we want to utilize for this test. Can any one please recommend the minimum HDD and RAM requirements for both servers ? Thanks
Re: High Availability hadoop cluster.
2 boxes for 2 NNs (better dedicated boxes) min 3 JNs min 3 ZKs JNs and ZKs can share boxes with other services On Wed, Nov 5, 2014 at 11:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hello. We are using hortonwork distribution and want to evaluate HA capabilities. Can community please share the best practices and potential problems? Is it required additional hardware? Thanks Oleg.
Re: High Availability hadoop cluster.
For 17 box cluster it's probably good to run 5 ZKs and 5 JNs So, run 2 ZKs on 2 NNs 3 ZKs on 3 DNs same for JNs you can start additional ZKs and JNs after you are done with initial Enabling HA in Ambari On Thu, Nov 6, 2014 at 3:01 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Great. Thank you for the link. Just to be sure - JN can be installed on data nodes like zookeeper? If we have 2 Name Nodes and 15 Data Nodes - is it correct to install ZK and JN on datanodes machines? Thanks Oleg. On Thu, Nov 6, 2014 at 5:06 PM, Alexander Pivovarov apivova...@gmail.com wrote: To Enable HA open Ambari, go to Admin, select HA, click enable HA http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/install-ha_2x.html On Thu, Nov 6, 2014 at 12:45 AM, Oleg Ruchovets oruchov...@gmail.com wrote: Our deployment is 15 data nodes and 2 NN. ZKs installed also as part of hortonworks distributions. Sorry for dummy question - what is JN is? Can you please please point me on some manual wiki for installation / configuration. Thanks Oleg. On Thu, Nov 6, 2014 at 4:04 PM, Alexander Pivovarov apivova...@gmail.com wrote: 2 boxes for 2 NNs (better dedicated boxes) min 3 JNs min 3 ZKs JNs and ZKs can share boxes with other services On Wed, Nov 5, 2014 at 11:31 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hello. We are using hortonwork distribution and want to evaluate HA capabilities. Can community please share the best practices and potential problems? Is it required additional hardware? Thanks Oleg.
Re: problems with Hadoop instalation
Are RHEL7 based OSs supported? On Wed, Oct 29, 2014 at 3:59 PM, David Novogrodsky david.novogrod...@gmail.com wrote: All, I am new to Hadoop so any help would be appreciated. I have a question for the mailing list regarding Hadoop. I have installed the most recent stable version (2.4.1) on a virtual machine running CentOS 7. I have tried to run this command %Hadoop -fs ls but without success. The question is, what does Hadoop consider a valid JAVA_HOME directory? And where should the JAVA_HOME directory variable be defined? I installed Java using the package manager yum. I installed the most recent version, detailed below. T his is in my .bashrc file # The java implementation to use. export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64 [david@localhost ~]$ hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory then I tried this value for JAVA_HOME in my .bashrc file. /usr/bin/Java. [david@localhost ~]$ which java /usr/bin/java [david@localhost ~]$ java -version java version 1.7.0_71 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14) OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode) here is the result: [david@localhost ~]$hadoop fs -ls /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a directory /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java: cannot execute: Not a directory David Novogrodsky
Re: Spark vs Tez
Spark creator Amplab did some benchmarks. https://amplab.cs.berkeley.edu/benchmark/ On Fri, Oct 17, 2014 at 11:06 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B.
Re: Spark vs Tez
It's going to be spark engine for hive (in addition to mr and tez). Spark API is available for Java and Python as well. Tez engine is available now and it's quite stable. As for speed. For complex queries it shows 10x-20x improvement in comparison to mr engine. e.g. one of my queries runs 30 min using mr (about 100 mr jobs), if I switch to tez it done in 100 sec. I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1) On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. *From:* Shahab Yunus shahab.yu...@gmail.com *Sent:* Friday, October 17, 2014 1:12 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs Tez What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B.
Re: No space when running a hadoop job
It can read/write in parallel to all drives. More hdd more io speed. On Sep 27, 2014 7:28 AM, Susheel Kumar Gadalay skgada...@gmail.com wrote: Correct me if I am wrong. Adding multiple directories will not balance the files distributions across these locations. Hadoop will add exhaust the first directory and then start using the next, next .. How can I tell Hadoop to evenly balance across these directories. On 9/26/14, Matt Narrell matt.narr...@gmail.com wrote: You can add a comma separated list of paths to the “dfs.datanode.data.dir” property in your hdfs-site.xml mn On Sep 26, 2014, at 8:37 AM, Abdul Navaz navaz@gmail.com wrote: Hi I am facing some space issue when I saving file into HDFS and/or running map reduce job. root@nn:~# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda2 5.9G 5.9G 0 100% / udev 98M 4.0K 98M 1% /dev tmpfs 48M 192K 48M 1% /run none 5.0M 0 5.0M 0% /run/lock none 120M 0 120M 0% /run/shm overflow 1.0M 4.0K 1020K 1% /tmp /dev/xvda4 7.9G 147M 7.4G 2% /mnt 172.17.253.254:/q/groups/ch-geni-net/Hadoop-NET 198G 108G 75G 59% /groups/ch-geni-net/Hadoop-NET 172.17.253.254:/q/proj/ch-geni-net 198G 108G 75G 59% /proj/ch-geni-net root@nn:~# I can see there is no space left on /dev/xvda2. How can I make hadoop to see newly mounted /dev/xvda4 ? Or do I need to move the file manually from /dev/xvda2 to xvda4 ? Thanks Regards, Abdul Navaz Research Assistant University of Houston Main Campus, Houston TX Ph: 281-685-0388
Re: Tez and MapReduce
e.g. in hive to switch engines set hive.execution.engine=mr; or set hive.execution.engine=tez; tez is faster especially on complex queries. On Aug 31, 2014 10:33 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Can Tez and MapReduce live together and get along in the same cluster? B.
Re: How to serialize very large object in Hadoop Writable?
Max array size is max integer. So, byte array can not be bigger than 2GB On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote: Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: How to serialize very large object in Hadoop Writable?
Usually Hadoop Map Reduce deals with row based data. ReduceContextKEYIN,VALUEIN,KEYOUT,VALUEOUT if you need to write a lot to hdfs file you can get OutputStream to hdfs file and write bytes. On Fri, Aug 22, 2014 at 3:30 PM, Yuriy yuriythe...@gmail.com wrote: Thank you, Alexander. That, at least, explains the problem. And what should be the workaround if the combined set of data is larger than 2 GB? On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com wrote: Max array size is max integer. So, byte array can not be bigger than 2GB On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote: Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: Setting Up First Hadoop / Yarn Cluster
Probably permission issue. On Thu, Jul 31, 2014 at 11:32 AM, Houston King houston.k...@gmail.com wrote: Hey Everyone, I'm a noob working to setup my first 13 node Hadoop 2.4.0 cluster, and I've run into some problems that I'm having a heck of a time debugging. I've been following the guide posted at http://www.implementhadoop.com/install-hadoop-2-4-0-multi-node-cluster/ to setup the cluster. I've gotten through the guide, but, when I attempt to run either the wordcount, pi, or randomwriter examples most / all my tasks end up failing: 14/07/31 12:23:14 INFO mapreduce.Job: map 0% reduce 0% 14/07/31 12:23:28 INFO mapreduce.Job: Task Id : attempt_1406829336833_0002_m_00_0, Status : FAILED 14/07/31 12:23:42 INFO mapreduce.Job: Task Id : attempt_1406829336833_0002_m_00_1, Status : FAILED 14/07/31 12:23:56 INFO mapreduce.Job: Task Id : attempt_1406829336833_0002_m_00_2, Status : FAILED 14/07/31 12:24:12 INFO mapreduce.Job: map 100% reduce 100% 14/07/31 12:24:13 INFO mapreduce.Job: Job job_1406829336833_0002 failed with state FAILED due to: Task failed task_1406829336833_0002_m_00 I've been trying to figure out if I have a configuration problem or where in the logfiles the problem is described, but without much luck. At this point, I'm looking for any help I can get to get this cluster going. I appreciate any and all suggestions! Thanks ~Houston King
Re: doubt
it' enough. hadoop uses only 1GB RAM by default. On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote: Hi , i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough for this or shall i need to expand ? please suggest about my query. than x -- amiable harsha
Re: HDFS disk space requirement
finish elementary school first. (plus, minus operations at least) On Thu, Jan 10, 2013 at 7:23 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Thank you for the response. Actually it is not a single file, I have JSON files that amount to 115 GB, these JSON files need to be processed and loaded into a Hbase data tables on the same cluster for later processing. Not considering the disk space required for the Hbase storage, If I reduce the replication to 3, how much more HDFS space will I require? Thank you, On Fri, Jan 11, 2013 at 4:16 AM, Ravi Mutyala r...@hortonworks.comwrote: If the file is a txt file, you could get a good compression ratio. Changing the replication to 3 and the file will fit. But not sure what your usecase is what you want to achieve by putting this data there. Any transformation on this data and you would need more space to save the transformed data. If you have 5 nodes and they are not virtual machines, you should consider adding more harddisks to your cluster. On Thu, Jan 10, 2013 at 9:02 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Hello, I have a hadoop cluster of 5 nodes with a total of available HDFS space 130 GB with replication set to 5. I have a file of 115 GB, which needs to be copied to the HDFS and processed. Do I need to have anymore HDFS space for performing all processing without running into any problems? or is this space sufficient? -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Which hardware to choose
Not sure the following options are available Integrated ICH10R on motherboard LSI® 6Gb SAS2008 daughtercard Dell PERC H200 Dell PERC H700 LSI MegaRAID® SAS 9260-8i http://www.dell.com/us/enterprise/p/poweredge-c2100/pd On Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets oruchov...@gmail.comwrote: Great , Thank you for the such detailed information, By the way what type of Disk Controller do you use? Thanks Oleg. On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov apivova...@gmail.com wrote: Privet Oleg Cloudera and Dell setup the following cluster for my company Company receives 1.5 TB raw data per day 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 Name Node and Secondary Name Node are the similar but 96GB RAM (not sure why) 6x600Gb 15 RPM Serial SCSI RAID10 another config is here page 298 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false you probably need just 1 computer with 10 x 2 TB SATA HDD On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , We are on a very early stage of our hadoop project and want to do a POC. We have ~ 5-6 terabytes of row data and we are going to execute some aggregations. We plan to use 8 - 10 machines Questions: 1) Which hardware should we use: a) How many discs , what discs is better to use? b) How many RAM? c) How many CPUs? 2) Please share best practices and tips / tricks related to utilise hardware using for hadoop projects. Thanks in advance Oleg.
Re: Which hardware to choose
All configs are per node. No HBase, only Hive and Pig installed On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.comwrote: I think he's saying that its 24 maps 8 reducers per node and at 48GB that could be too many mappers. Especially if they want to run HBase. On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote: Only 24 map and 8 reduce tasks for 38 data nodes? are you sure that's right? Sounds VERY low for a cluster that size. We have only 10 c2100's and are running I believe 140 map and 70 reduce slots so far with pretty decent performance. On 10/02/2012 12:55 PM, Alexander Pivovarov wrote: 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8
Re: Which hardware to choose
Privet Oleg Cloudera and Dell setup the following cluster for my company Company receives 1.5 TB raw data per day 38 data nodes + 2 Name Nodes Data Node: Dell PowerEdge C2100 series 2 x XEON x5670 48 GB RAM ECC (12x4GB 1333MHz) 12 x 2 TB 7200 RPM SATA HDD (with hot swap) JBOD Intel Gigabit ET Dual port PCIe x4 Redundant Power Supply Hadoop CDH3 max map tasks 24 max reduce tasks 8 Name Node and Secondary Name Node are the similar but 96GB RAM (not sure why) 6x600Gb 15 RPM Serial SCSI RAID10 another config is here page 298 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false you probably need just 1 computer with 10 x 2 TB SATA HDD On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , We are on a very early stage of our hadoop project and want to do a POC. We have ~ 5-6 terabytes of row data and we are going to execute some aggregations. We plan to use 8 - 10 machines Questions: 1) Which hardware should we use: a) How many discs , what discs is better to use? b) How many RAM? c) How many CPUs? 2) Please share best practices and tips / tricks related to utilise hardware using for hadoop projects. Thanks in advance Oleg.
Re: More cores Vs More Nodes ?
more nodes means more IO on read on mapper step If you use combiners you might need to send only small amount of data over network to reducers Alexander On Tue, Dec 13, 2011 at 12:45 PM, real great.. greatness.hardn...@gmail.com wrote: more cores might help in hadoop environments as there would be more data locality. your thoughts? On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield b...@bing.com wrote: Praveenesh, Your question is not naïve; in fact, optimal hardware design can ultimately be a very difficult question to answer on what would be better. If you made me pick one without much information I'd go for more machines. But... It all depends; and there is no right answer :) More machines +May run your workload faster +Will give you a higher degree of reliability protection from node / hardware / hard drive failure. +More aggregate IO capabilities - capex / opex may be higher than allocating more cores More cores +May run your workload faster +More cores may allow for more tasks to run on the same machine +More cores/tasks may reduce network contention and increase increasing task to task data flow performance. Notice May run your workload faster is in both; as it can be very workload dependant. My Experience: I did a recent experiment and found that given the same number of cores (64) with the exact same network / machine configuration; A: I had 8 machines with 8 cores B: I had 28 machines with 2 cores (and 1x8 core head node) B was able to outperform A by 2x using teragen and terasort. These machines were running in a virtualized environment; where some of the IO capabilities behind the scenes were being regulated to 400Mbps per node when running in the 2 core configuration vs 1Gbps on the 8 core. So I would expect the non-throttled scenario to work even better. ~Brad -Original Message- From: praveenesh kumar [mailto:praveen...@gmail.com] Sent: Monday, December 12, 2011 8:51 PM To: common-user@hadoop.apache.org Subject: More cores Vs More Nodes ? Hey Guys, So I have a very naive question in my mind regarding Hadoop cluster nodes ? more cores or more nodes - Shall I spend money on going from 2-4 core machines, or spend money on buying more nodes less core eg. say 2 machines of 2 cores for example? Thanks, Praveenesh -- Regards, R.V.
What's the diff btw setOutputKeyComparatorClass and setOutputValueGroupingComparator?
I tried to use one or another for secondary sort -- both options work fine -- I get combined sorted result in reduce() iterator Also I noticed that if I set both of them at the same time then KeyComparatorClass.compare(O1, O2) never called, hadoop calls only ValueGroupingComparator.compare() I run my tests on single node instillation. Please help me understand the diff btw these two comparators.