from:"Alexander Pivovarov"

Re: fs.s3a.endpoint not working

2016-01-14 Thread Alexander Pivovarov

http://www.jets3t.org/toolkit/configuration.html
On Jan 14, 2016 10:56 AM, "Alexander Pivovarov" <apivova...@gmail.com>
wrote:

> Add jets3t.properties file with s3service.s3-endpoint= to
> /etc/hadoop/conf folder
>
> The folder with the file should be in HADOOP_CLASSPATH
>
> JetS3t library which is used by hadoop is looking for this file.
> On Dec 22, 2015 12:39 PM, "Phillips, Caleb" <caleb.phill...@nrel.gov>
> wrote:
>
>> Hi All,
>>
>> New to this list. Looking for a bit of help:
>>
>> I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object
>> store.
>>
>> This issue was discussed, but left unresolved, in this thread:
>>
>>
>> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3cca+0w_au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com%3E
>>
>> And here, on Cloudera's forums (the second post is mine):
>>
>>
>> https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoint-ignored-in-hdfs-site-xml/m-p/33694#M1180
>>
>> I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using
>> Hadoop, I'm able to connect to S3 on AWS, and e.g., list/put/get files.
>>
>> However, when I point the fs.s3a.endpoint configuration directive at my
>> non-AWS S3-Compatable object storage, it appears to still point at (and
>> authenticate against) AWS.
>>
>> I've checked and double-checked my credentials and configuration using
>> both Python's boto library and the s3cmd tool, both of which connect to
>> this non-AWS data store just fine.
>>
>> Any help would be much appreciated. Thanks!
>>
>> --
>> Caleb Phillips, Ph.D.
>> Data Scientist | Computational Science Center
>>
>> National Renewable Energy Laboratory (NREL)
>> 15013 Denver West Parkway | Golden, CO 80401
>> 303-275-4297 | caleb.phill...@nrel.gov
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
>> For additional commands, e-mail: user-h...@hadoop.apache.org
>>
>>

Re: fs.s3a.endpoint not working

2016-01-14 Thread Alexander Pivovarov

Add jets3t.properties file with s3service.s3-endpoint= to
/etc/hadoop/conf folder

The folder with the file should be in HADOOP_CLASSPATH

JetS3t library which is used by hadoop is looking for this file.
On Dec 22, 2015 12:39 PM, "Phillips, Caleb"  wrote:

> Hi All,
>
> New to this list. Looking for a bit of help:
>
> I'm having trouble connecting Hadoop to a S3-compatable (non AWS) object
> store.
>
> This issue was discussed, but left unresolved, in this thread:
>
>
> https://mail-archives.apache.org/mod_mbox/spark-user/201507.mbox/%3cca+0w_au5es_flugzmgwkkga3jya1asi3u+isjcuymfntvnk...@mail.gmail.com%3E
>
> And here, on Cloudera's forums (the second post is mine):
>
>
> https://community.cloudera.com/t5/Data-Ingestion-Integration/fs-s3a-endpoint-ignored-in-hdfs-site-xml/m-p/33694#M1180
>
> I'm running Hadoop 2.6.3 with Java 1.8 (65) on a Linux host. Using Hadoop,
> I'm able to connect to S3 on AWS, and e.g., list/put/get files.
>
> However, when I point the fs.s3a.endpoint configuration directive at my
> non-AWS S3-Compatable object storage, it appears to still point at (and
> authenticate against) AWS.
>
> I've checked and double-checked my credentials and configuration using
> both Python's boto library and the s3cmd tool, both of which connect to
> this non-AWS data store just fine.
>
> Any help would be much appreciated. Thanks!
>
> --
> Caleb Phillips, Ph.D.
> Data Scientist | Computational Science Center
>
> National Renewable Energy Laboratory (NREL)
> 15013 Denver West Parkway | Golden, CO 80401
> 303-275-4297 | caleb.phill...@nrel.gov
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>

What settings I need to access remove HA cluster.

2015-06-29 Thread Alexander Pivovarov

Hi Everyone

I have 2 HA clusters mydev and myqa

I want to have an ability to access hdfs://myqa/ paths from mydev cluster
boxes.

What settings should I add to mydev hdfs-site.xml so that hadoop can
resolve myqa HA alias to active NN?

Thank you
Alex

Re: How do I integrate Hadoop app development with Eclipse IDE?

2015-05-20 Thread Alexander Pivovarov

1. create pom.xml for your project
2. add hadoop dependencies which you need
3. $ mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
4. import existing java project to eclipse


On Wed, May 20, 2015 at 5:31 PM, Caesar Samsi caesarsa...@mac.com wrote:

 Hello,



 I’m embarking on my first tutorial and would like to have tooltip help as
 I hover my mouse pointer over Hadoop classes.



 I’ve found the Hadoop docs and Javadoc URL and configured them but the
 tooltips still don’t show up.



 Thanks you, Caesar.

Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)

2015-05-15 Thread Alexander Pivovarov

I also noticed another error message in logs

10848 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor -
Status: Failed
10849 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor -
Vertex failed, vertexName=Map 32, vertexId=vertex_1431616132488_6430_1_24,
diagnostics=[Vertex Input: dual initializer failed.,
org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find
class: org.apache.commons.logging.impl.SLF4JLocationAwareLog
Serialization trace:
LOG (org.apache.hadoop.hive.ql.exec.UDTFOperator)
childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)]

one of the WITH blocks had explode() UDTF
I replaced it with select ... union all select ... union all select ...
and query is working fine now.

Do you know anything about UDTF and Kryo issues fixed after 0.13.1?


On Fri, May 15, 2015 at 3:20 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 Looks like it was fixed in hive-0.14
 https://issues.apache.org/jira/browse/HIVE-7079

 On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Hi Everyone

 I'm using hive-0.13.1   (HDP-2.1.5) and getting the following stacktrace
 if run my query (which has WITH block) via Oozie.   (BTW, the query works
 fine in CLI)

 I can't put exact query but the structure is similar to

 create table my_consumer
 as
 with sacusaloan as (select distinct e,f,g from E)

 select A.a, A.b, A.c,
   if(sacusaloan.id is null, 0, 1) as sacusaloan_status
 from (select a,b,c from A) A
 left join sacusaloan on (...)

 8799 [main] INFO  hive.ql.parse.ParseDriver  - Parse Completed
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - /PERFLOG 
 method=parse start=1431723485500 end=1431723485602 duration=102 
 from=org.apache.hadoop.hive.ql.Driver
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - PERFLOG 
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 8834 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Starting Semantic Analysis
 8837 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Creating table wk_qualified_outsource_loan_consumer position=13
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Completed phase 1 of Semantic Analysis
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - Get 
 metadata for source tables
 8865 [main] ERROR hive.ql.metadata.Hive  - 
 NoSuchObjectException(message:default.sacusaloan table not found)
  at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
  at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
  at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
  at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
  at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
  at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
  at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
  at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:606)
  at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
  at com.sun.proxy.$Proxy18.getTable(Unknown Source)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
  at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
  at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263)
  at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232)
  at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252)
  at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427)
  at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
  at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
  at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268

Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)

2015-05-15 Thread Alexander Pivovarov

Looks like I found it
https://issues.apache.org/jira/browse/HIVE-9409

public class UDTFOperator
...

-  protected final Log LOG = LogFactory.getLog(this.getClass().getName());
+  protected static final Log LOG =
LogFactory.getLog(UDTFOperator.class.getName());



On Fri, May 15, 2015 at 4:17 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 I also noticed another error message in logs

 10848 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor -
 Status: Failed
 10849 [main] ERROR org.apache.hadoop.hive.ql.exec.tez.TezJobMonitor -
 Vertex failed, vertexName=Map 32, vertexId=vertex_1431616132488_6430_1_24,
 diagnostics=[Vertex Input: dual initializer failed.,
 org.apache.hive.com.esotericsoftware.kryo.KryoException: Unable to find
 class: org.apache.commons.logging.impl.SLF4JLocationAwareLog
 Serialization trace:
 LOG (org.apache.hadoop.hive.ql.exec.UDTFOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.SelectOperator)
 childOperators (org.apache.hadoop.hive.ql.exec.TableScanOperator)
 aliasToWork (org.apache.hadoop.hive.ql.plan.MapWork)]

 one of the WITH blocks had explode() UDTF
 I replaced it with select ... union all select ... union all select ...
 and query is working fine now.

 Do you know anything about UDTF and Kryo issues fixed after 0.13.1?


 On Fri, May 15, 2015 at 3:20 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Looks like it was fixed in hive-0.14
 https://issues.apache.org/jira/browse/HIVE-7079

 On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov 
 apivova...@gmail.com wrote:

 Hi Everyone

 I'm using hive-0.13.1   (HDP-2.1.5) and getting the following stacktrace
 if run my query (which has WITH block) via Oozie.   (BTW, the query works
 fine in CLI)

 I can't put exact query but the structure is similar to

 create table my_consumer
 as
 with sacusaloan as (select distinct e,f,g from E)

 select A.a, A.b, A.c,
   if(sacusaloan.id is null, 0, 1) as sacusaloan_status
 from (select a,b,c from A) A
 left join sacusaloan on (...)

 8799 [main] INFO  hive.ql.parse.ParseDriver  - Parse Completed
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - /PERFLOG 
 method=parse start=1431723485500 end=1431723485602 duration=102 
 from=org.apache.hadoop.hive.ql.Driver
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - PERFLOG 
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 8834 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Starting Semantic Analysis
 8837 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Creating table wk_qualified_outsource_loan_consumer position=13
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Completed phase 1 of Semantic Analysis
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - Get 
 metadata for source tables
 8865 [main] ERROR hive.ql.metadata.Hive  - 
 NoSuchObjectException(message:default.sacusaloan table not found)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
 at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
 at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
 at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
 at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:606)
 at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
 at com.sun.proxy.$Proxy18.getTable(Unknown Source)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
 at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
 at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263)
 at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232)
 at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252)
 at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427)
 at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323

Re: query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)

2015-05-15 Thread Alexander Pivovarov

Looks like it was fixed in hive-0.14
https://issues.apache.org/jira/browse/HIVE-7079

On Fri, May 15, 2015 at 2:26 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 Hi Everyone

 I'm using hive-0.13.1   (HDP-2.1.5) and getting the following stacktrace
 if run my query (which has WITH block) via Oozie.   (BTW, the query works
 fine in CLI)

 I can't put exact query but the structure is similar to

 create table my_consumer
 as
 with sacusaloan as (select distinct e,f,g from E)

 select A.a, A.b, A.c,
   if(sacusaloan.id is null, 0, 1) as sacusaloan_status
 from (select a,b,c from A) A
 left join sacusaloan on (...)

 8799 [main] INFO  hive.ql.parse.ParseDriver  - Parse Completed
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - /PERFLOG 
 method=parse start=1431723485500 end=1431723485602 duration=102 
 from=org.apache.hadoop.hive.ql.Driver
 8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - PERFLOG 
 method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
 8834 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Starting Semantic Analysis
 8837 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Creating table wk_qualified_outsource_loan_consumer position=13
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - 
 Completed phase 1 of Semantic Analysis
 8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  - Get 
 metadata for source tables
 8865 [main] ERROR hive.ql.metadata.Hive  - 
 NoSuchObjectException(message:default.sacusaloan table not found)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
   at 
 org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
   at 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
   at com.sun.proxy.$Proxy18.getTable(Unknown Source)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
   at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263)
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232)
   at 
 org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252)
   at 
 org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427)
   at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
   at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
   at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
   at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
   at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
   at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
   at 
 org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
   at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
   at 
 org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749)
   at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
   at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
   at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:316)
   at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:277)
   at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
   at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:66)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method

query uses WITH blocks and throws exception if run as Oozie hive action (hive-0.13.1)

2015-05-15 Thread Alexander Pivovarov

Hi Everyone

I'm using hive-0.13.1   (HDP-2.1.5) and getting the following stacktrace if
run my query (which has WITH block) via Oozie.   (BTW, the query works fine
in CLI)

I can't put exact query but the structure is similar to

create table my_consumer
as
with sacusaloan as (select distinct e,f,g from E)

select A.a, A.b, A.c,
  if(sacusaloan.id is null, 0, 1) as sacusaloan_status
from (select a,b,c from A) A
left join sacusaloan on (...)

8799 [main] INFO  hive.ql.parse.ParseDriver  - Parse Completed
8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  -
/PERFLOG method=parse start=1431723485500 end=1431723485602
duration=102 from=org.apache.hadoop.hive.ql.Driver
8799 [main] INFO  org.apache.hadoop.hive.ql.log.PerfLogger  - PERFLOG
method=semanticAnalyze from=org.apache.hadoop.hive.ql.Driver
8834 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  -
Starting Semantic Analysis
8837 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  -
Creating table wk_qualified_outsource_loan_consumer position=13
8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  -
Completed phase 1 of Semantic Analysis
8861 [main] INFO  org.apache.hadoop.hive.ql.parse.SemanticAnalyzer  -
Get metadata for source tables
8865 [main] ERROR hive.ql.metadata.Hive  -
NoSuchObjectException(message:default.sacusaloan table not found)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29338)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result$get_table_resultStandardScheme.read(ThriftHiveMetastore.java:29306)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$get_table_result.read(ThriftHiveMetastore.java:29237)
at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:1036)
at 
org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table(ThriftHiveMetastore.java:1022)
at 
org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:997)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:89)
at com.sun.proxy.$Proxy18.getTable(Unknown Source)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:976)
at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:918)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1263)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1232)
at 
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:9252)
at 
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:327)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:427)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:323)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:980)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1045)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:916)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:906)
at 
org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:268)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:220)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:423)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:359)
at 
org.apache.hadoop.hive.cli.CliDriver.processReader(CliDriver.java:456)
at org.apache.hadoop.hive.cli.CliDriver.processFile(CliDriver.java:466)
at 
org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:749)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:686)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:625)
at org.apache.oozie.action.hadoop.HiveMain.runHive(HiveMain.java:316)
at org.apache.oozie.action.hadoop.HiveMain.run(HiveMain.java:277)
at org.apache.oozie.action.hadoop.LauncherMain.run(LauncherMain.java:38)
at org.apache.oozie.action.hadoop.HiveMain.main(HiveMain.java:66)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at

Re: how to load data

2015-05-01 Thread Alexander Pivovarov

if your file is csv file then create table statement should specify
CSVSerde - look at the examples under the links I sent you

On Thu, Apr 30, 2015 at 10:23 PM, Kumar Jayapal kjayapa...@gmail.com
wrote:

 Alex,


 I followed the same steps as mentioned in the site. Once I load data into
 table which is create below



 Created table  CREATE TABLE raw (line STRING) PARTITIONED BY (FISCAL_YEAR
  smallint, FISCAL_PERIOD smallint)
 STORED AS TEXTFILE;

 and loaded it with data.

 LOAD DATA LOCAL INPATH '/tmp/weblogs/20090603-access.log.gz' INTO TABLE
 raw;



 when I say select * from raw it shows all null values.


 NULLNULLNULLNULLNULLNULLNULLNULL
 NULLNULLNULLNULLNULLNULLNULLNULL
 NULLNULLNULLNULLNULLNULLNULLNULL
 NULLNULLNULLNULLNULLNULLNULLNULL
 Why is not show showing the actual data in file. will it show once I load
 it to parque table?

 Please let me know if I am doing anything wrong.

 I appreciate your help.


 Thanks
 jay



 Thank you very much for you help Alex,


 On Wed, Apr 29, 2015 at 3:43 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 1. Create external textfile hive table pointing to /extract/DBCLOC and
 specify CSVSerde

 if using hive-0.14 and newer use this
 https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
 if hive-0.13 and older use https://github.com/ogrodnek/csv-serde

 You do not even need to unzgip the file. hive automatically unzgip data
 on select.

 2. run simple query to load data
 insert overwrite table orc_table
 select * from csv_table

 On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com
 wrote:

 Hello All,


 I have this table


 CREATE  TABLE DBCLOC(
BLwhse int COMMENT 'DECIMAL(5,0) Whse',
BLsdat string COMMENT 'DATE Sales Date',
BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#',
BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#',
BLscnr string COMMENT 'CHAR(1) Scenario',
BLareq string COMMENT 'CHAR(1) Act Requested',
BLatak string COMMENT 'CHAR(1) Act Taken',
BLmsgc string COMMENT 'CHAR(3) Msg Code')
 PARTITIONED BY (FSCAL_YEAR  smallint, FSCAL_PERIOD smallint)
 STORED AS PARQUET;

 have to load from hdfs location  /extract/DBCLOC/DBCL0301P.csv.gz to
 the table above


 Can any one tell me what is the most efficient way of doing it.


 Thanks
 Jay

Re: how to load data

2015-04-30 Thread Alexander Pivovarov

Follow the links I sent you already.
 On Apr 30, 2015 11:52 AM, Kumar Jayapal kjayapa...@gmail.com wrote:

 Hi Alex,

 How to create external textfile hive table pointing to /extract/DBCLOC and
 specify CSVSerde ?

 Thanks
 Jay

 On Wed, Apr 29, 2015 at 3:43 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 1. Create external textfile hive table pointing to /extract/DBCLOC and
 specify CSVSerde

 if using hive-0.14 and newer use this
 https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
 if hive-0.13 and older use https://github.com/ogrodnek/csv-serde

 You do not even need to unzgip the file. hive automatically unzgip data
 on select.

 2. run simple query to load data
 insert overwrite table orc_table
 select * from csv_table

 On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com
 wrote:

 Hello All,


 I have this table


 CREATE  TABLE DBCLOC(
BLwhse int COMMENT 'DECIMAL(5,0) Whse',
BLsdat string COMMENT 'DATE Sales Date',
BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#',
BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#',
BLscnr string COMMENT 'CHAR(1) Scenario',
BLareq string COMMENT 'CHAR(1) Act Requested',
BLatak string COMMENT 'CHAR(1) Act Taken',
BLmsgc string COMMENT 'CHAR(3) Msg Code')
 PARTITIONED BY (FSCAL_YEAR  smallint, FSCAL_PERIOD smallint)
 STORED AS PARQUET;

 have to load from hdfs location  /extract/DBCLOC/DBCL0301P.csv.gz to
 the table above


 Can any one tell me what is the most efficient way of doing it.


 Thanks
 Jay

Re: How to move back to .gz file from hive to hdfs

2015-04-30 Thread Alexander Pivovarov

Try to find the file in hdfs trash
On Apr 30, 2015 2:14 PM, Kumar Jayapal kjayapa...@gmail.com wrote:

 Hi,

 I loaded one file to hive table it is in .gz extension. file is
 moved/deleted from hdfs.

 when I execute select command I get an error.

 Error: Error while processing statement: FAILED: Execution Error, return
 code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
 (state=08S01,code=2)

 how can I move back the file to HDFS. how can I do it.

 Thanks
 Jay

Re: How to move back to .gz file from hive to hdfs

2015-04-30 Thread Alexander Pivovarov

try
desc formatted table_name;
it shows you table location on hdfs

On Thu, Apr 30, 2015 at 2:43 PM, Kumar Jayapal kjayapa...@gmail.com wrote:

 I did not find it in .Trash file is moved to hive table I want to move it
 back to hdfs.

 On Thu, Apr 30, 2015 at 2:20 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Try to find the file in hdfs trash
 On Apr 30, 2015 2:14 PM, Kumar Jayapal kjayapa...@gmail.com wrote:

 Hi,

 I loaded one file to hive table it is in .gz extension. file is
 moved/deleted from hdfs.

 when I execute select command I get an error.

 Error: Error while processing statement: FAILED: Execution Error, return
 code 2 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask
 (state=08S01,code=2)

 how can I move back the file to HDFS. how can I do it.

 Thanks
 Jay

Re: how to load data

2015-04-29 Thread Alexander Pivovarov

1. Create external textfile hive table pointing to /extract/DBCLOC and
specify CSVSerde

if using hive-0.14 and newer use this
https://cwiki.apache.org/confluence/display/Hive/CSV+Serde
if hive-0.13 and older use https://github.com/ogrodnek/csv-serde

You do not even need to unzgip the file. hive automatically unzgip data on
select.

2. run simple query to load data
insert overwrite table orc_table
select * from csv_table

On Wed, Apr 29, 2015 at 3:26 PM, Kumar Jayapal kjayapa...@gmail.com wrote:

 Hello All,


 I have this table


 CREATE  TABLE DBCLOC(
BLwhse int COMMENT 'DECIMAL(5,0) Whse',
BLsdat string COMMENT 'DATE Sales Date',
BLreg_num smallint COMMENT 'DECIMAL(3,0) Reg#',
BLtrn_num int COMMENT 'DECIMAL(5,0) Trn#',
BLscnr string COMMENT 'CHAR(1) Scenario',
BLareq string COMMENT 'CHAR(1) Act Requested',
BLatak string COMMENT 'CHAR(1) Act Taken',
BLmsgc string COMMENT 'CHAR(3) Msg Code')
 PARTITIONED BY (FSCAL_YEAR  smallint, FSCAL_PERIOD smallint)
 STORED AS PARQUET;

 have to load from hdfs location  /extract/DBCLOC/DBCL0301P.csv.gz to the
 table above


 Can any one tell me what is the most efficient way of doing it.


 Thanks
 Jay

Re: sorting in hive -- general

2015-03-08 Thread Alexander Pivovarov

1. sort by -
key are distributed according to MR partitioner  (controlled by distributed
by in hive)

Lets assume hash partitioned uses the same column as sort by and uses x mod
16 formula to get reducer id

reduced 0 will have keys
0
16
32

reducer 1 will have keys
1
17
33


if you merge reducer 0 and reducer 1 output you will have
0
16
32
1
17
33


2. order by will use 1 reducer and hive will send all keys to reducer 0

So order by in hive works different from terasort. In case of terasort
you can merge output files and get one file with globally sorted data.




On Sun, Mar 8, 2015 at 7:55 AM, max scalf oracle.bl...@gmail.com wrote:

 Thank you Alexander.  So is it fair to assume when sort by is used and
 multiple files are produced per reducer at the end of it all of then are
 put togeather/merged to get the results back?

 And can sort by be used without distributed by and expect same result as
 order by ?

 On Sat, Mar 7, 2015 at 7:05 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

 sort by query produces multiple independent files.

 order by - just one file

 usually sort by is used with distributed by.
 In older hive versions (0.7) they might be used to implement local sort
 within partition
 similar to RANK() OVER (PARTITION BY A ORDER BY B)


 On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY
 clause. ORDER BY performs a parallel total sort of the input (like that
 described in “Total Sort” on page 261). When a globally sorted result is
 not required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted result?,
 if the sort by operation produces a sorted file per reducer does that mean
 at the end of the sort all the reducer are put back together to give the
 correct results ?

Re: sorting in hive -- general

2015-03-07 Thread Alexander Pivovarov

sort by query produces multiple independent files.

order by - just one file

usually sort by is used with distributed by.
In older hive versions (0.7) they might be used to implement local sort
within partition
similar to RANK() OVER (PARTITION BY A ORDER BY B)


On Sat, Mar 7, 2015 at 3:02 PM, max scalf oracle.bl...@gmail.com wrote:

 Hello all,

 I am a new to hadoop and hive in general and i am reading hadoop the
 definitive guide by Tom White and on page 504 for the hive chapter, Tom
 says below with regards to soritng

 *Sorting and Aggregating*
 *Sorting data in Hive can be achieved by using a standard ORDER BY clause.
 ORDER BY performs a parallel total sort of the input (like that described
 in “Total Sort” on page 261). When a globally sorted result is not
 required—and in many cases it isn’t—you can use Hive’s nonstandard
 extension, SORT BY, instead. SORT BY produces a sorted file per reducer.*


 My Questions is, what exactly does he mean by globally sorted result?,
 if the sort by operation produces a sorted file per reducer does that mean
 at the end of the sort all the reducer are put back together to give the
 correct results ?

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Alexander Pivovarov

what about DNS?
if you have 2 computers (nn and dn) how nn knows dn ip?

The script puts only this computer ip to /etc/hosts

On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote:

 Here is a easy way to go about assigning static name to your ec2
 instance.  When you get the launch an EC2-instance from aws console when
 you get to the point of selecting VPC, ip address screen there is a screen
 that says USER DATA...put the below in with appropriate host name(change
 CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get
 you static name.

 #!/bin/bash

 HOSTNAME_TAG=CHANGE_HOST_NAME_HERE
 cat  /etc/sysconfig/network  EOF
 NETWORKING=yes
 NETWORKING_IPV6=no
 HOSTNAME=${HOSTNAME_TAG}
 EOF

 IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
 echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG}  /etc/hosts

 echo ${HOSTNAME_TAG}  /proc/sys/kernel/hostname
 service network restart


 Also note i was able to do this on couple of spot instance for cheap
 price, only thing is once you shut it down or someone outbids you, you
 loose that instance but its easy/cheap to play around with and i have
 used couple of m3.medium for my NN/SNN and couple of them for data nodes...

 On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net
  wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used
 in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error.
 HUE is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then
 would like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Alexander Pivovarov

I think EMR has its own limitation

e.g. I want to setup hadoop 2.6.0 with kerberos + hive-1.2.0 to test my
hive patch.

How EMR can help me?  it supports hadoop up to 2.4.0  (not even 2.4.1)
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-hadoop-version.html







On Thu, Mar 5, 2015 at 9:51 PM, Jonathan Aquilina jaquil...@eagleeyet.net
wrote:

  Hi guys I know you guys want to keep costs down, but why go through all
 the effort to setup ec2 instances when you deploy EMR it takes the time to
 provision and setup the ec2 instances for you. All configuration then for
 the entire cluster is done on the master node of the particular cluster or
 setting up of additional software that is all done through the EMR console.
 We were doing some geospatial calculations and we loaded a 3rd party jar
 file called esri into the EMR cluster. I then had to pass a small bootstrap
 action (script) to have it distribute esri to the entire cluster.

 Why are you guys reinventing the wheel?



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 03:35, Alexander Pivovarov wrote:

I found the following solution to this problem

 I registered 2 subdomains  (public and local) for each computer on
 https://freedns.afraid.org/subdomain/
 e.g.
 myhadoop-nn.crabdance.com
 myhadoop-nn-local.crabdance.com

 then I added cron job which sends http requests to update public and local
 ip on freedns server
 hint: public ip is detected automatically
 ip address for local name can be set using request parameter address=10.x.x.x
 (don't forget to escape )

 as a result my nn computer has 2 DNS names with currently assigned ip
 addresses , e.g.
 myhadoop-nn.crabdance.com  54.203.181.177
 myhadoop-nn-local.crabdance.com   10.220.149.103

 in hadoop configuration I can use local machine names
 to access my cluster outside of AWS I can use public names

 Just curious if AWS provides easier way to name EC2 computers?

 On Thu, Mar 5, 2015 at 5:19 PM, Jonathan Aquilina jaquil...@eagleeyet.net
  wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be used
 in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error.
 HUE is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then
 would like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Alexander Pivovarov

What is the easiest way to assign names to aws ec2 computers?
I guess computer need static hostname and dns name before it can be used in
hadoop cluster.
On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net wrote:

  When I started with EMR it was alot of testing and trial and error. HUE
 is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then would
 like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina jaquil...@eagleeyet.net
  wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish

Re: t2.micro on AWS; Is it enough for setting up Hadoop cluster ?

2015-03-05 Thread Alexander Pivovarov

ok, how we can easily put all hadoop computer names and IPs to /etc/hosts
on all computers?
Do you have a script? or I need manually go to each computer, get its ip
and put it to /etc/hosts and then distribute /etc/hosts to all machines?

Don't you think one time effort to configure freedns is easier?
freedns solution works with AWS spot-instances as well.

You need to create snapshot after you configure freedns, hadoop, etc on
particular box.
Next time you need computer you can can go to your saved snapshots and
create spot-instance from it.


On Thu, Mar 5, 2015 at 6:54 PM, max scalf oracle.bl...@gmail.com wrote:

 unfortunately without DNS you have to rely on /etc/hosts, so put in entry
 for all your nodes(nn,snn,dn1,dn2 etc..) on all nodes(/etc/hosts file) and
 i have that tested for hortonworks(using ambari) and cloudera manager and i
 am certainly sure it will work for MapR

 On Thu, Mar 5, 2015 at 8:47 PM, Alexander Pivovarov apivova...@gmail.com
 wrote:

 what about DNS?
 if you have 2 computers (nn and dn) how nn knows dn ip?

 The script puts only this computer ip to /etc/hosts

 On Thu, Mar 5, 2015 at 6:39 PM, max scalf oracle.bl...@gmail.com wrote:

 Here is a easy way to go about assigning static name to your ec2
 instance.  When you get the launch an EC2-instance from aws console when
 you get to the point of selecting VPC, ip address screen there is a screen
 that says USER DATA...put the below in with appropriate host name(change
 CHANGE_HOST_NAME_HERE to whatever you want) and that should be able to get
 you static name.

 #!/bin/bash

 HOSTNAME_TAG=CHANGE_HOST_NAME_HERE
 cat  /etc/sysconfig/network  EOF
 NETWORKING=yes
 NETWORKING_IPV6=no
 HOSTNAME=${HOSTNAME_TAG}
 EOF

 IP=$(curl http://169.254.169.254/latest/meta-data/local-ipv4)
 echo ${IP} ${HOSTNAME_TAG}.localhost ${HOSTNAME_TAG}  /etc/hosts

 echo ${HOSTNAME_TAG}  /proc/sys/kernel/hostname
 service network restart


 Also note i was able to do this on couple of spot instance for cheap
 price, only thing is once you shut it down or someone outbids you, you
 loose that instance but its easy/cheap to play around with and i have
 used couple of m3.medium for my NN/SNN and couple of them for data nodes...

 On Thu, Mar 5, 2015 at 7:19 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  I dont know how you would do that to be honest. With EMR you have
 destinctions master core and task nodes. If you need to change
 configuration you just ssh into the EMR master node.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 02:11, Alexander Pivovarov wrote:

 What is the easiest way to assign names to aws ec2 computers?
 I guess computer need static hostname and dns name before it can be
 used in hadoop cluster.
 On Mar 5, 2015 4:36 PM, Jonathan Aquilina jaquil...@eagleeyet.net
 wrote:

  When I started with EMR it was alot of testing and trial and error.
 HUE is already supported as something that can be installed from the AWS
 console. What I need to know is if you need this cluster on all the time 
 or
 this is goign ot be what amazon call a transient cluster. Meaning you fire
 it up run the job and tear it back down.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

  On 2015-03-06 01:10, Krish Donald wrote:

  Thanks Jonathan,

 I will try to explore EMR option also.
 Can you please let me know the configuration which you have used it?
 Can you please recommend for me also?
 I would like to setup Hadoop cluster using cloudera manager and then
 would like to do below things:

 setup kerberos
 setup federation
 setup monitoring
 setup hadr
 backup and recovery
 authorization using sentry
 backup and recovery of individual componenets
 performamce tuning
 upgrade of cdh
 upgrade of CM
 Hue User Administration
 Spark
 Solr


 Thanks
 Krish


 On Thu, Mar 5, 2015 at 3:57 PM, Jonathan Aquilina 
 jaquil...@eagleeyet.net wrote:

  krish EMR wont cost you much with all the testing and data we ran
 through the test systems as well as the large amont of data when 
 everythign
 was read we paid about 15.00 USD. I honestly do not think that the specs
 there would be enough as java can be pretty ram hungry.



 ---
 Regards,
 Jonathan Aquilina
 Founder Eagle Eye T

   On 2015-03-06 00:41, Krish Donald wrote:

  Hi,

 I am new to AWS and would like to setup Hadoop cluster using cloudera
 manager for 6-7 nodes.

 t2.micro on AWS; Is it enough for setting up Hadoop cluster ?
 I would like to use free service as of now.

 Please advise.

 Thanks
 Krish

Re: Kerberos Security in Hadoop

2015-02-18 Thread Alexander Pivovarov

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Managing_Smart_Cards/Configuring_a_Kerberos_5_Server.html

On Wed, Feb 18, 2015 at 4:49 PM, Krish Donald gotomyp...@gmail.com wrote:

 Hi,

 Has anybody worked on Kerberos security on Hadoop ?
 Can you please guide me , any document link will be appreciated ?

 Thanks
 Krish

Re: Kerberos Security in Hadoop

2015-02-18 Thread Alexander Pivovarov

http://www.cloudera.com/content/cloudera/en/documentation/core/v5-2-x/topics/cdh_sg_authentication.html

On Wed, Feb 18, 2015 at 5:49 PM, Manoj Samel manojsamelt...@gmail.com
wrote:

 Cloureda also has good documentation on setting up kerberos based cluser -
 This can be used even if you are not using cloudera manager to setup your
 cluster.

 On Wed, Feb 18, 2015 at 4:51 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:


 http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html


 https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Managing_Smart_Cards/Configuring_a_Kerberos_5_Server.html

 On Wed, Feb 18, 2015 at 4:49 PM, Krish Donald gotomyp...@gmail.com
 wrote:

 Hi,

 Has anybody worked on Kerberos security on Hadoop ?
 Can you please guide me , any document link will be appreciated ?

 Thanks
 Krish

Re: Copying many files to HDFS

2015-02-16 Thread Alexander Pivovarov

Hi Kevin,

What is network throughput btw
1. NFS server and client machine?
2. client machine and dananodes?

Alex

On Feb 13, 2015 5:29 AM, Kevin kevin.macksa...@gmail.com wrote:

 Hi,

 I am setting up a Hadoop cluster (CDH5.1.3) and I need to copy a thousand
 or so files into HDFS, which totals roughly 1 TB. The cluster will be
 isolated on its own private LAN with a single client machine that is
 connected to the Hadoop cluster as well as the public network. The data
 that needs to be copied into HDFS is mounted as an NFS on the client
 machine.

 I can run `hadoop fs -put` concurrently on the client machine to try and
 increase the throughput.

 If these files were able to be accessed by each node in the Hadoop
 cluster, then I could write a MapReduce job to copy a number of files from
 the network into HDFS. I could not find anything in the documentation
 saying that `distcp` works with locally hosted files (its code in the tools
 package doesn't tell any sign of it either) - but I wouldn't expect it to.

 In general, are there any other ways of copying a very large number of
 client-local files to HDFS? I search the mail archives to find a similar
 question and I didn't come across one. I'm sorry if this is a duplicate
 question.

 Thanks for your time,
 Kevin

Re: Building for Windows

2015-02-11 Thread Alexander Pivovarov

try

mvn package -Pdist -Dtar -DskipTests

On Wed, Feb 11, 2015 at 2:02 PM, Lucio Crusca lu...@sulweb.org wrote:

 Hello everybody,

 I'm absolutely new to hadoop and a customer asked me to build version 2.6
 for
 Windows Server 2012 R2. I'm myself a java programmer, among other things,
 but
 I've never used hadoop before.

 I've downloaded and installed JDK7, Maven, Cygwin (for sh, mv, gzip, ...)
 and
 other toys specified in the BUILDING.txt file bundled with hadoop sources.
 I
 also set the PATH and other environment variables (JAVA_HOME, ZLIB_HOME,
 Platform, ...).

 Running

 mvn package -Pdist -Dtar

 it compiles everything till MiniKDC included, but then it fails a test
 after
 compiling Auth.

 Here you can see the full output of the mvn command:

 http://hastebin.com/aqixebojuv.tex

 Can you help me understand what I'm doing wrong?

 Thanks in advance
 Lucio.

Re: Building for Windows

2015-02-11 Thread Alexander Pivovarov

in addition to skipTests you want to add native-win profile

 mvn clean package -Pdist,native-win -DskipTests -Dtar

this command must be run from a Windows SDK command prompt (not cygwin) as
documented in BUILDING.txt. A successful build generates a binary hadoop
.tar.gz package in hadoop-dist\target\.

https://wiki.apache.org/hadoop/Hadoop2OnWindows

https://svn.apache.org/viewvc/hadoop/common/branches/branch-2/BUILDING.txt?view=markup

On Wed, Feb 11, 2015 at 3:09 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 try

 mvn package -Pdist -Dtar -DskipTests

 On Wed, Feb 11, 2015 at 2:02 PM, Lucio Crusca lu...@sulweb.org wrote:

 Hello everybody,

 I'm absolutely new to hadoop and a customer asked me to build version 2.6
 for
 Windows Server 2012 R2. I'm myself a java programmer, among other things,
 but
 I've never used hadoop before.

 I've downloaded and installed JDK7, Maven, Cygwin (for sh, mv, gzip, ...)
 and
 other toys specified in the BUILDING.txt file bundled with hadoop
 sources. I
 also set the PATH and other environment variables (JAVA_HOME, ZLIB_HOME,
 Platform, ...).

 Running

 mvn package -Pdist -Dtar

 it compiles everything till MiniKDC included, but then it fails a test
 after
 compiling Auth.

 Here you can see the full output of the mvn command:

 http://hastebin.com/aqixebojuv.tex

 Can you help me understand what I'm doing wrong?

 Thanks in advance
 Lucio.

Re: Building for Windows

2015-02-11 Thread Alexander Pivovarov

There are about 3000 tests
It should be particular box configuration to run all tests successfully
you should have lots of memory
It takes min 1 hour to run all tests

Look at hadoop pre-commit builds on jenkins
https://builds.apache.org/job/PreCommit-HADOOP-Build/


On Wed, Feb 11, 2015 at 3:55 PM, Lucio Crusca lu...@sulweb.org wrote:

 In data mercoledì 11 febbraio 2015 15:17:23, Alexander Pivovarov ha
 scritto:
  in addition to skipTests you want to add native-win profile
 
   mvn clean package -Pdist,native-win -DskipTests -Dtar

 Ok thanks but... what's the point of having tests in place if you have to
 skip
 them in order to build?

 
  this command must be run from a Windows SDK command prompt (not cygwin)

 Yes I was already doing that, Cygwin is only installed to provide a few
 required unix commands (from BUILDING.txt: * Unix command-line tools from
 GnuWin32 or Cygwin: sh, mkdir, rm, cp, tar, gzip). For some reason I don't
 remember I had problems with GnuWin32 and went for Cygwin instead.

Re: Multiple separate Hadoop clusters on same physical machines

2015-02-01 Thread Alexander Pivovarov

start several vms and install hadoop on each vm
keywords: kvm, QEMU

On Mon, Jan 26, 2015 at 1:18 AM, Harun Reşit Zafer 
harun.za...@tubitak.gov.tr wrote:

 Hi everyone,

 We have set up and been playing with Hadoop 1.2.x and its friends (Hbase,
 pig, hive etc.) on 7 physical servers. We want to test Hadoop (maybe
 different versions) and ecosystem on physical machines (virtualization is
 not an option) from different perspectives.

 As a bunch of developer we would like to work in parallel. We want every
 team member play with his/her own cluster. However we have limited amount
 of servers (strong machines though).

 So the question is, by changing port numbers, environment variables and
 other configuration parameters, is it possible to setup several independent
 clusters on same physical machines. Is there any constraints? What are the
 possible difficulties we are to face?

 Thanks in advance

 --
 Harun Reşit Zafer
 TÜBİTAK BİLGEM BTE
 Bulut Bilişim ve Büyük Veri Analiz Sistemleri Bölümü
 T +90 262 675 3268
 W  http://www.hrzafer.com

Re: way to add custom udf jar in hadoop 2.x version

2014-12-31 Thread Alexander Pivovarov

I found that the easiest way is to put udf  jar to /usr/lib/hadoop-mapred
on all computers in the cluster. Hive cli, hiveserver2, oozie launcher,
oozie hive action, mr will see the jar then. I'm using hdp-2.1.5
On Dec 30, 2014 10:58 PM, reena upadhyay reena2...@gmail.com wrote:

 Hi,

 I am using hadoop 2.4.0 version. I have created custom udf jar. I am
 trying to execute a simple select udf query using java hive jdbc client
 program. When hive execute the query using map reduce job, then the query
 execution get fails because the mapper is not able to locate the udf class.
 So I wanted to add the udf jar in hadoop environment permanently. Please
 suggest me a way to add this external jar for single node and multi node
 hadoop cluster.

 PS: I am using hive 0.13.1 version and I already have this custom udf jar
 added in HIVE_HOME/lib directory.


 Thanks

Re: Hardware requirements for simple node hadoop cluster

2014-12-07 Thread Alexander Pivovarov

for balanced conf you need (per core)
1-1.5  2 GB 7200 SATA hdd for hdfs  (in JBOD mode, not RAID)
3-4 GB RAM ECC

reserve  4GB RAM for OS

better to use separate hdd or usb stick for OS

e.g. for 16 cores you can use
16-24  2GB hdds
64 GB RAM (if planing to use Apache Spark put 128 GB)


On Sun, Dec 7, 2014 at 12:08 AM, Amjad Syed amjad...@gmail.com wrote:

 Hello,

 We are trying to do a proof of concept at our Data center with two node
 hadoop cluster.

 We have two  (dual socket quad core ) HP proliant  DL 380G6  servers we
 want to utilize for this test.

 Can  any one  please recommend the minimum HDD  and RAM requirements for
 both servers ?

 Thanks

Re: High Availability hadoop cluster.

2014-11-06 Thread Alexander Pivovarov

2 boxes for 2 NNs  (better dedicated boxes)
min 3 JNs
min 3 ZKs

JNs and ZKs can share boxes with other services


On Wed, Nov 5, 2014 at 11:31 PM, Oleg Ruchovets oruchov...@gmail.com
wrote:

 Hello.
 We are using hortonwork distribution and want to evaluate HA capabilities.
 Can community please share the best practices and potential problems?
 Is it required additional hardware?

 Thanks
 Oleg.

Re: High Availability hadoop cluster.

2014-11-06 Thread Alexander Pivovarov

For 17 box cluster it's probably good to run 5 ZKs and 5 JNs

So, run 2 ZKs on 2 NNs
3 ZKs on 3 DNs

same for JNs

you can start additional ZKs and JNs after you are done with initial
Enabling HA in Ambari

On Thu, Nov 6, 2014 at 3:01 AM, Oleg Ruchovets oruchov...@gmail.com wrote:

Great.
Thank you for the link.
Just to be sure - JN can be installed on data nodes like zookeeper?
If we have 2 Name Nodes and 15 Data Nodes - is it correct to install ZK
and JN on datanodes machines?
Thanks
Oleg.

On Thu, Nov 6, 2014 at 5:06 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

To Enable HA open Ambari, go to Admin, select HA, click enable HA

http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.0.9.0/bk_using_Ambari_book/content/install-ha_2x.html

On Thu, Nov 6, 2014 at 12:45 AM, Oleg Ruchovets oruchov...@gmail.com
wrote:

Our deployment is 15 data nodes and 2 NN. ZKs installed also as part of
hortonworks distributions.
Sorry for dummy question - what is JN is?
Can you please please point me on some manual wiki for installation /
configuration.

Thanks
Oleg.

On Thu, Nov 6, 2014 at 4:04 PM, Alexander Pivovarov
apivova...@gmail.com wrote:

2 boxes for 2 NNs (better dedicated boxes)
min 3 JNs
min 3 ZKs

JNs and ZKs can share boxes with other services

On Wed, Nov 5, 2014 at 11:31 PM, Oleg Ruchovets oruchov...@gmail.com
wrote:

Hello.
We are using hortonwork distribution and want to evaluate HA
capabilities.
Can community please share the best practices and potential problems?
Is it required additional hardware?

Thanks
Oleg.

Re: problems with Hadoop instalation

2014-10-29 Thread Alexander Pivovarov

Are RHEL7 based OSs supported?


On Wed, Oct 29, 2014 at 3:59 PM, David Novogrodsky 
david.novogrod...@gmail.com wrote:

 All,

 I am new to Hadoop so any help would be appreciated.

 I have a question for the mailing list regarding Hadoop.  I have installed
 the most recent stable version (2.4.1) on a virtual machine running CentOS
 7.  I have tried to run this command
 %Hadoop -fs ls but without success.

 The question is, what does Hadoop consider a valid JAVA_HOME directory?
 And where should the JAVA_HOME directory variable be defined?  I installed
 Java using the package manager yum.  I installed the most recent version,
 detailed below.


 T
 his is in my .bashrc file
 # The java implementation to use.
 export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64


 [david@localhost ~]$ hadoop fs -ls
 /usr/local/hadoop/bin/hadoop: line 133:
 /usr/lib/jvm/jre-1.7.0-openjdk.x86_64/bin/java: No such file or directory


 then I tried this value for JAVA_HOME
  in my .bashrc file.
 /usr/bin/Java.
 
 [david@localhost ~]$ which java
 /usr/bin/java
 [david@localhost ~]$ java -version
 java version 1.7.0_71
 OpenJDK Runtime Environment (rhel-2.5.3.1.el7_0-x86_64 u71-b14)
 OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)

 here is the result:
 [david@localhost ~]$hadoop fs -ls
 /usr/local/hadoop/bin/hadoop: line 133: /usr/bin/java/bin/java: Not a
 directory
 /usr/local/hadoop/bin/hadoop: line 133: exec: /usr/bin/java/bin/java:
 cannot execute: Not a directory

 David Novogrodsky

Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov

Spark creator Amplab did some benchmarks.
https://amplab.cs.berkeley.edu/benchmark/

On Fri, Oct 17, 2014 at 11:06 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.

Re: Spark vs Tez

2014-10-17 Thread Alexander Pivovarov

It's going to be spark engine for hive (in addition to mr and tez).

Spark API is available for Java and Python as well.

Tez engine is available now and it's quite stable. As for speed.  For
complex queries it shows 10x-20x improvement in comparison to mr engine.
e.g. one of my queries runs 30 min using mr (about 100 mr jobs),   if I
switch to tez it done in 100 sec.

I'm using HDP-2.1.5 (hive-0.13.1, tez 0.4.1)

On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   It was my understanding that Spark is faster batch processing. Tez is
 the new execution engine that replaces MapReduce and is also supposed to
 speed up batch processing. Is that not correct?
 B.



  *From:* Shahab Yunus shahab.yu...@gmail.com
 *Sent:* Friday, October 17, 2014 1:12 PM
 *To:* user@hadoop.apache.org
 *Subject:* Re: Spark vs Tez

  What aspects of Tez and Spark are you comparing? They have different
 purposes and thus not directly comparable, as far as I understand.

 Regards,
 Shahab

 On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:

   Does anybody have any performance figures on how Spark stacks up
 against Tez? If you don’t have figures, does anybody have an opinion? Spark
 seems so popular but I’m not really seeing why.
 B.

Re: No space when running a hadoop job

2014-09-27 Thread Alexander Pivovarov

It can read/write in parallel to all drives. More hdd more io speed.
 On Sep 27, 2014 7:28 AM, Susheel Kumar Gadalay skgada...@gmail.com
wrote:

 Correct me if I am wrong.

 Adding multiple directories will not balance the files distributions
 across these locations.

 Hadoop will add exhaust the first directory and then start using the
 next, next ..

 How can I tell Hadoop to evenly balance across these directories.

 On 9/26/14, Matt Narrell matt.narr...@gmail.com wrote:
  You can add a comma separated list of paths to the
 “dfs.datanode.data.dir”
  property in your hdfs-site.xml
 
  mn
 
  On Sep 26, 2014, at 8:37 AM, Abdul Navaz navaz@gmail.com wrote:
 
  Hi
 
  I am facing some space issue when I saving file into HDFS and/or running
  map reduce job.
 
  root@nn:~# df -h
  Filesystem   Size  Used Avail Use%
  Mounted on
  /dev/xvda2   5.9G  5.9G 0 100% /
  udev  98M  4.0K   98M   1%
  /dev
  tmpfs 48M  192K   48M   1%
  /run
  none 5.0M 0  5.0M   0%
  /run/lock
  none 120M 0  120M   0%
  /run/shm
  overflow 1.0M  4.0K 1020K   1%
  /tmp
  /dev/xvda4   7.9G  147M  7.4G   2%
  /mnt
  172.17.253.254:/q/groups/ch-geni-net/Hadoop-NET  198G  108G   75G  59%
  /groups/ch-geni-net/Hadoop-NET
  172.17.253.254:/q/proj/ch-geni-net   198G  108G   75G  59%
  /proj/ch-geni-net
  root@nn:~#
 
 
  I can see there is no space left on /dev/xvda2.
 
  How can I make hadoop to see newly mounted /dev/xvda4 ? Or do I need to
  move the file manually from /dev/xvda2 to xvda4 ?
 
 
 
  Thanks  Regards,
 
  Abdul Navaz
  Research Assistant
  University of Houston Main Campus, Houston TX
  Ph: 281-685-0388

Re: Tez and MapReduce

2014-09-01 Thread Alexander Pivovarov

e.g. in hive to switch engines
set hive.execution.engine=mr;
or
set hive.execution.engine=tez;

tez is faster especially on complex queries.
On Aug 31, 2014 10:33 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

   Can Tez and MapReduce live together and get along in the same cluster?
 B.

Re: How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Alexander Pivovarov

Max array size is max integer. So, byte array can not be bigger than 2GB
On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote:

 Hadoop Writable interface relies on public void write(DataOutput out) 
 method.
 It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
 which uses a simple array under the cover.

 When I try to write a lot of data in DataOutput in my reducer, I get:

 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit at java.util.Arrays.copyOf(Arrays.java:3230) at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
 java.io.DataOutputStream.write(DataOutputStream.java:107) at
 java.io.FilterOutputStream.write(FilterOutputStream.java:97)

 Looks like the system is unable to allocate the continuous array of the
 requested size. Apparently, increasing the heap size available to the
 reducer does not help - it is already at 84GB (-Xmx84G)

 If I cannot reduce the size of the object that I need to serialize (as the
 reducer constructs this object by combining the object data), what should I
 try to work around this problem?

 Thanks,

 Yuriy

Re: How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Alexander Pivovarov

Usually Hadoop Map Reduce deals with row based data.
ReduceContextKEYIN,VALUEIN,KEYOUT,VALUEOUT

if you need to write a lot to hdfs file you can get OutputStream to hdfs
file and write bytes.


On Fri, Aug 22, 2014 at 3:30 PM, Yuriy yuriythe...@gmail.com wrote:

 Thank you, Alexander. That, at least, explains the problem. And what
 should be the workaround if the combined set of data is larger than 2 GB?


 On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Max array size is max integer. So, byte array can not be bigger than 2GB
 On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote:

  Hadoop Writable interface relies on public void write(DataOutput out) 
 method.
 It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
 which uses a simple array under the cover.

 When I try to write a lot of data in DataOutput in my reducer, I get:

 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit at java.util.Arrays.copyOf(Arrays.java:3230) at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
 java.io.DataOutputStream.write(DataOutputStream.java:107) at
 java.io.FilterOutputStream.write(FilterOutputStream.java:97)

 Looks like the system is unable to allocate the continuous array of the
 requested size. Apparently, increasing the heap size available to the
 reducer does not help - it is already at 84GB (-Xmx84G)

 If I cannot reduce the size of the object that I need to serialize (as
 the reducer constructs this object by combining the object data), what
 should I try to work around this problem?

 Thanks,

 Yuriy

Re: Setting Up First Hadoop / Yarn Cluster

2014-07-31 Thread Alexander Pivovarov

Probably permission issue.


On Thu, Jul 31, 2014 at 11:32 AM, Houston King houston.k...@gmail.com
wrote:

 Hey Everyone,

 I'm a noob working to setup my first 13 node Hadoop 2.4.0 cluster, and
 I've run into some problems that I'm having a heck of a time debugging.

 I've been following the guide posted at
 http://www.implementhadoop.com/install-hadoop-2-4-0-multi-node-cluster/
 to setup the cluster.  I've gotten through the guide, but, when I attempt
 to run either the wordcount, pi, or randomwriter examples most / all my
 tasks end up failing:

 14/07/31 12:23:14 INFO mapreduce.Job:  map 0% reduce 0%
 14/07/31 12:23:28 INFO mapreduce.Job: Task Id :
 attempt_1406829336833_0002_m_00_0, Status : FAILED
 14/07/31 12:23:42 INFO mapreduce.Job: Task Id :
 attempt_1406829336833_0002_m_00_1, Status : FAILED
 14/07/31 12:23:56 INFO mapreduce.Job: Task Id :
 attempt_1406829336833_0002_m_00_2, Status : FAILED
 14/07/31 12:24:12 INFO mapreduce.Job:  map 100% reduce 100%
 14/07/31 12:24:13 INFO mapreduce.Job: Job job_1406829336833_0002 failed
 with state FAILED due to: Task failed task_1406829336833_0002_m_00

 I've been trying to figure out if I have a configuration problem or where
 in the logfiles the problem is described, but without much luck.  At this
 point, I'm looking for any help I can get to get this cluster going.  I
 appreciate any and all suggestions!

 Thanks
 ~Houston King

Re: doubt

2014-01-18 Thread Alexander Pivovarov

it' enough. hadoop uses only 1GB RAM by default.


On Sat, Jan 18, 2014 at 10:11 PM, sri harsha rsharsh...@gmail.com wrote:

 Hi ,
 i want to install 4 node cluster in 64-bit LINUX. 4GB RAM 500HD is enough
 for this or shall i need to expand ?
 please suggest about my query.

 than x

 --
 amiable harsha

Re: HDFS disk space requirement

2013-01-10 Thread Alexander Pivovarov

finish elementary school first. (plus, minus operations at least)


On Thu, Jan 10, 2013 at 7:23 PM, Panshul Whisper ouchwhis...@gmail.comwrote:

 Thank you for the response.

 Actually it is not a single file, I have JSON files that amount to 115 GB,
 these JSON files need to be processed and loaded into a Hbase data tables
 on the same cluster for later processing. Not considering the disk space
 required for the Hbase storage, If I reduce the replication to 3, how much
 more HDFS space will I require?

 Thank you,


 On Fri, Jan 11, 2013 at 4:16 AM, Ravi Mutyala r...@hortonworks.comwrote:

 If the file is a txt file, you could get a good compression ratio.
 Changing the replication to 3 and the file will fit. But not sure what your
 usecase is what you want to achieve by putting this data there. Any
 transformation on this data and you would need more space to save the
 transformed data.

 If you have 5 nodes and they are not virtual machines, you should
 consider adding more harddisks to your cluster.


 On Thu, Jan 10, 2013 at 9:02 PM, Panshul Whisper 
 ouchwhis...@gmail.comwrote:

 Hello,

 I have a hadoop cluster of 5 nodes with a total of available HDFS space
 130 GB with replication set to 5.
 I have a file of 115 GB, which needs to be copied to the HDFS and
 processed.
 Do I need to have anymore HDFS space for performing all processing
 without running into any problems? or is this space sufficient?

 --
 Regards,
 Ouch Whisper
 010101010101





 --
 Regards,
 Ouch Whisper
 010101010101

Re: Which hardware to choose

2012-10-02 Thread Alexander Pivovarov

Not sure

the following options are available
Integrated ICH10R on motherboard
LSI® 6Gb SAS2008 daughtercard
Dell PERC H200
Dell PERC H700
LSI MegaRAID® SAS 9260-8i

http://www.dell.com/us/enterprise/p/poweredge-c2100/pd

On Tue, Oct 2, 2012 at 10:59 AM, Oleg Ruchovets oruchov...@gmail.comwrote:

 Great ,

 Thank you for the such detailed information,

 By the way what type of Disk Controller do you use?

 Thanks
 Oleg.


 On Tue, Oct 2, 2012 at 6:34 AM, Alexander Pivovarov apivova...@gmail.com
 wrote:

  Privet Oleg
 
  Cloudera and Dell setup the following cluster for my company
  Company receives 1.5 TB raw data per day
 
  38 data nodes + 2 Name Nodes
 
  Data Node:
  Dell PowerEdge C2100 series
  2 x XEON x5670
  48 GB RAM ECC  (12x4GB 1333MHz)
  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
  Intel Gigabit ET Dual port PCIe x4
  Redundant Power Supply
  Hadoop CDH3
  max map tasks 24
  max reduce tasks 8
 
  Name Node and Secondary Name Node are the similar but
  96GB RAM  (not sure why)
  6x600Gb 15 RPM Serial SCSI
  RAID10
 
 
  another config is here
  page 298
 
 
 http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false
 
 
  you probably need just 1 computer with 10 x 2 TB SATA HDD
 
 
 
  On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com
  wrote:
 
   Hi ,
 We are on a very early stage of our hadoop project and want to do a
  POC.
  
   We have ~ 5-6 terabytes of row data and we are going to execute some
   aggregations.
  
   We plan to use  8 - 10 machines
  
   Questions:
  
 1)  Which hardware should we use:
   a) How many discs , what discs is better to use?
   b) How many RAM?
   c) How many CPUs?
  
  
  2) Please share best practices and tips / tricks related to utilise
   hardware using for hadoop projects.
  
   Thanks in advance
   Oleg.

Re: Which hardware to choose

2012-10-02 Thread Alexander Pivovarov

All configs are per node.
No HBase, only Hive and Pig installed

On Tue, Oct 2, 2012 at 9:40 PM, Michael Segel michael_se...@hotmail.comwrote:

 I think he's saying that its 24 maps 8 reducers per node and at 48GB that
 could be too many mappers.
 Especially if they want to run HBase.

 On Oct 2, 2012, at 8:14 PM, hadoopman hadoop...@gmail.com wrote:

  Only 24 map and 8 reduce tasks for 38 data nodes?  are you sure that's
 right?  Sounds VERY low for a cluster that size.
 
  We have only 10 c2100's and are running I believe 140 map and 70 reduce
 slots so far with pretty decent performance.
 
 
 
  On 10/02/2012 12:55 PM, Alexander Pivovarov wrote:
  38 data nodes + 2 Name Nodes

  Data Node:
  Dell PowerEdge C2100 series
  2 x XEON x5670
  48 GB RAM ECC  (12x4GB 1333MHz)
  12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
  Intel Gigabit ET Dual port PCIe x4
  Redundant Power Supply
  Hadoop CDH3
  max map tasks 24
  max reduce tasks 8

Re: Which hardware to choose

2012-10-01 Thread Alexander Pivovarov

Privet Oleg

Cloudera and Dell setup the following cluster for my company
Company receives 1.5 TB raw data per day

38 data nodes + 2 Name Nodes

Data Node:
Dell PowerEdge C2100 series
2 x XEON x5670
48 GB RAM ECC  (12x4GB 1333MHz)
12 x 2 TB  7200 RPM SATA HDD (with hot swap)  JBOD
Intel Gigabit ET Dual port PCIe x4
Redundant Power Supply
Hadoop CDH3
max map tasks 24
max reduce tasks 8

Name Node and Secondary Name Node are the similar but
96GB RAM  (not sure why)
6x600Gb 15 RPM Serial SCSI
RAID10


another config is here
page 298
http://books.google.com/books?id=Wu_xeGdU4G8Cpg=PA298lpg=PA298dq=hadoop+jbodsource=blots=i7xVQBPb_wsig=8mhq-MtpkRcTiRB1ioKciMxIasghl=ensa=Xei=AGtqUMK6D8T10gHD4ICQAQved=0CEMQ6AEwAg#v=onepageq=hadoop%20jbodf=false


you probably need just 1 computer with 10 x 2 TB SATA HDD



On Mon, Oct 1, 2012 at 6:02 PM, Oleg Ruchovets oruchov...@gmail.com wrote:

 Hi ,
   We are on a very early stage of our hadoop project and want to do a POC.

 We have ~ 5-6 terabytes of row data and we are going to execute some
 aggregations.

 We plan to use  8 - 10 machines

 Questions:

   1)  Which hardware should we use:
 a) How many discs , what discs is better to use?
 b) How many RAM?
 c) How many CPUs?


2) Please share best practices and tips / tricks related to utilise
 hardware using for hadoop projects.

 Thanks in advance
 Oleg.

Re: More cores Vs More Nodes ?

2011-12-13 Thread Alexander Pivovarov

more nodes means more IO on read on mapper step
If you use combiners you might need to send only small amount of data over
network to reducers

Alexander


On Tue, Dec 13, 2011 at 12:45 PM, real great.. greatness.hardn...@gmail.com
 wrote:

 more cores might help in hadoop environments as there would be more data
 locality.
 your thoughts?

 On Tue, Dec 13, 2011 at 11:11 PM, Brad Sarsfield b...@bing.com wrote:

  Praveenesh,
 
  Your question is not naïve; in fact, optimal hardware design can
  ultimately be a very difficult question to answer on what would be
  better. If you made me pick one without much information I'd go for
 more
  machines.  But...
 
  It all depends; and there is no right answer :)
 
  More machines
 +May run your workload faster
 +Will give you a higher degree of reliability protection from node
  / hardware / hard drive failure.
 +More aggregate IO capabilities
 - capex / opex may be higher than allocating more cores
  More cores
 +May run your workload faster
 +More cores may allow for more tasks to run on the same machine
 +More cores/tasks may reduce network contention and increase
  increasing task to task data flow performance.
 
  Notice May run your workload faster is in both; as it can be very
  workload dependant.
 
  My Experience:
  I did a recent experiment and found that given the same number of cores
  (64) with the exact same network / machine configuration;
 A: I had 8 machines with 8 cores
 B: I had 28 machines with 2 cores (and 1x8 core head node)
 
  B was able to outperform A by 2x using teragen and terasort. These
  machines were running in a virtualized environment; where some of the IO
  capabilities behind the scenes were being regulated to 400Mbps per node
  when running in the 2 core configuration vs 1Gbps on the 8 core.  So I
  would expect the non-throttled scenario to work even better.
 
  ~Brad
 
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Monday, December 12, 2011 8:51 PM
  To: common-user@hadoop.apache.org
  Subject: More cores Vs More Nodes ?
 
  Hey Guys,
 
  So I have a very naive question in my mind regarding Hadoop cluster
 nodes ?
 
  more cores or more nodes - Shall I spend money on going from 2-4 core
  machines, or spend money on buying more nodes less core eg. say 2
 machines
  of 2 cores for example?
 
  Thanks,
  Praveenesh
 
 


 --
 Regards,
 R.V.

What's the diff btw setOutputKeyComparatorClass and setOutputValueGroupingComparator?

2011-11-26 Thread Alexander Pivovarov

I tried to use one or another for secondary sort -- both options work fine
-- I get combined sorted result in reduce() iterator

Also I noticed that if I set both of them at the same time
then KeyComparatorClass.compare(O1, O2) never called,  hadoop calls
only ValueGroupingComparator.compare()

I run my tests on single node instillation.

Please help me understand the diff btw these two comparators.

45 matches

Mail list logo