Re: hdfs dfsclient, possible to force storage datanode ?

2014-08-22 Thread Tirupati Reddy
Hadoop 2.5

On Thursday, August 21, 2014, norbi no...@rocknob.de wrote:

 hadoop 2.0 (cloudera cdh 4.7)

 Am 21.08.2014 um 16:23 schrieb Liu, Yi A:

 Which version are you using?

 Regards,
 Yi Liu


 -Original Message-
 From: norbi [mailto:no...@rocknob.de]
 Sent: Wednesday, August 20, 2014 10:14 PM
 To: user@hadoop.apache.org
 Subject: hdfs dfsclient, possible to force storage datanode ?

 hi list,

 we have 52 DNs and more hundred clients they are store and read data from
 hdfs. one rack has 3 DNs and about 15 clients.

 it is possible to force (if space is available) that  these 15 clients
 prefer the 3 DNs located in there own rack to store and read data?

 racklocation.conf with org.apache.hadoop.net.NetworkTopology is
 allready in use, but this does not help in this case

 thanks




-- 
Tirru


WebHdfs config problem

2014-08-22 Thread Charles Robertson
Hi all,

I've installed HDP 2.1 on CentOS 6.5, but I'm having a problem with
WebHDFS. When I try to use the file browser or design an oozie workflow in
Hue, I get a WebHdfs error. Attached is the error for the filebrowser.

It appears to be some kind of permissions error, but I have hdfs security
turned off, and web hdfs is enabled.

I've followed all the Hue setup instructions I can find and made sure all
the properties are set correctly.

Can anyone shed some light?

Thanks,
Charles
WebHdfsException at /filebrowser/
HTTPConnectionPool(host='localhost', port=50070): Max retries exceeded with 
url: /webhdfs/v1/user/admin?op=GETFILESTATUSuser.name=huedoas=admin (Caused 
by class 'socket.error': [Errno 111] Connection refused)
Request Method: GET
Request URL:http://[MyIP]:8000/filebrowser/
Django Version: 1.2.3
Exception Type: WebHdfsException
Exception Value:
HTTPConnectionPool(host='localhost', port=50070): Max retries exceeded with 
url: /webhdfs/v1/user/admin?op=GETFILESTATUSuser.name=huedoas=admin (Caused 
by class 'socket.error': [Errno 111] Connection refused)
Exception Location: 
/usr/lib/hue/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py in _stats, line 209
Python Executable:  /usr/bin/python2.6
Python Version: 2.6.6
Python Path:
['/usr/lib/hue/build/env/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/pip-0.6.3-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/Babel-0.9.6-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/BabelDjango-0.2.2-py2.6.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/Mako-0.7.2-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/Markdown-2.0.3-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/MarkupSafe-0.9.3-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/Paste-1.7.2-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/PyYAML-3.09-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/Pygments-1.3.1-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/South-0.7-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/Spawning-0.9.6-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/avro-1.5.0-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/configobj-4.6.0-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/django_auth_ldap-1.0.7-py2.6.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/django_extensions-0.5-py2.6.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/django_nose-0.5-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/elementtree-1.2.6_20050316-py2.6.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/enum-0.4.4-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/eventlet-0.9.14-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/greenlet-0.3.1-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/happybase-0.6-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/kerberos-1.1.1-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/lockfile-0.8-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/lxml-3.3.5-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/moxy-1.0.0-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/pam-0.1.3-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/pyOpenSSL-0.13-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/pycrypto-2.6-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/pysqlite-2.5.5-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/python_daemon-1.5.1-py2.6.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/python_ldap-2.3.13-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/pytidylib-0.2.1-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/requests-2.2.1-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/requests_kerberos-0.4-py2.6.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/sasl-0.1.1-py2.6-linux-x86_64.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/sh-1.08-py2.6.egg', 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/simplejson-2.0.9-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/threadframe-0.2-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/thrift-0.9.0-py2.6-linux-x86_64.egg',
 
'/usr/lib/hue/build/env/lib/python2.6/site-packages/urllib2_kerberos-0.1.6-py2.6.egg',
 '/usr/lib/hue/build/env/lib/python2.6/site-packages/xlrd-0.9.0-py2.6.egg', 

job.getCounters returns null in Yarn-based job

2014-08-22 Thread Shahab Yunus
Hello.

I am trying to access custom counters that I have created in an mapreduce
job on Yarn.

After job.waitForCompletion(true) call, I try to do job.getCounters() but I
get a null.

This only happens if I run a heavy job meaning a) a lot of data and b) lot
of reducers.

E.g. for 10million records with 20 reducers on a 10 node cluster it works.
But on 60million records with 70 reducers on a 10 node cluster it doesn't.

The job itself competes successfully.

I did see the following related JIRAs. But the first one is for old
version, pre-Yarn and those properties I think are not valid anymore.

The second one does not seem to provide a solution? I tried using the
suggested trick on the client side but no success there either.

MAPREDUCE-1920 https://issues.apache.org/jira/browse/MAPREDUCE-1920
MAPREDUCE-4442 https://issues.apache.org/jira/browse/MAPREDUCE-4442

Please advise, how can I retrieve my custom counters after jobs completion.
Am I missing something? Do I need to configure some job history stuff?  I
do a mention of ATS as well but I don't know how much that is applicable
here.

Thanks a lot.

My version is: 2.3.0-cdh5.1.0

Regards,
Shahab


Hadoop 2.5.0 - HDFS browser-based file view

2014-08-22 Thread Brian C. Huffman

All,

I noticed that that on Hadoop 2.5.0, when browsing the HDFS filesystem 
on port 50070, you can't view a file in the browser. Clicking a file 
gives a little popup with metadata and a download link. Can HDFS be 
configured to show plaintext file contents in the browser?


Thanks,
Brian



Appending to HDFS file

2014-08-22 Thread rab ra
Hello,

I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
streaming utility.

The executable needs to write large amount of information in a file.
However, this write is not done in single attempt. The file needs to be
appended with streams of information generated.

In the code, inside a loop, I open a file in hdfs, appends some
information. This is not working and I see only the last write.

How do I accomplish append operation in hadoop? Can anyone share a pointer
to me?




regards
Bala


Basic Hadoop 2.3 32-bit VM for general Hadoop Users

2014-08-22 Thread Support Team
We have released a very basic 32-bit VM (VirtualBox Image) for those users who 
want to get started with Hadoop, without worrying about configuration and 
dependencies. 

We have used CDH5.1 for this release which contains Hadoop 2.3 (YARN), Pig 
0.12, Hive 0.12, Sqoop 1.4.4 along with MySQL with 814MB of download size.

We have also packaged a simple use case of Wiki Page Hits analysis, which is 
explained in our blog here at www.lighthadoop.com.

This is a genuine effort to help adoption of Hadoop and its eco-system fast, 
especially for students, from our freelancing big data enthusiasts to get 
started with latest Hadoop, Pig and Hive. Thus reducing time and effort in 
installating and configuring the system, still keeping the Hardware requirement 
low.

The motivation behind this VM is there are users who own 32-bit systems(can 
address 4GB RAM, thats enough for basic Hadoop setup) and still want to try 
latest stable Hadoop. Thus enabling them to solve a use case without needing to 
buy a latest PC/Laptop with high amount of RAM.

Kindly send your feedback/suggestions to supp...@lighthadoop.com .

All suggestions welcome! Suggestions make us grow, thus serving more opensource 
community!

Thanks!
LightHadoop Team




Issues installing Cloudera Manager 5.1.1 on Amazon EC2 - Cloud Express Wizard

2014-08-22 Thread Adam Pritchard
Hi everyone,

*Problem*
I am having some trouble spinning up additional instances on Amazon using
Cloudera Express / Cloudera Manager 5.1.1.

I am able to install Cloudera manager on the Host machine through the
Cloudera installation wizard.

But I cannot spin up additional machines due to an authorization issue
which appears to be invalid.  *Note my credintials work when installing an
older version of Cloudera Manager 4.8.4


*Error occurs when*
However when I try to spin up additional instances (step 3 of the Cloud
Express Wizard)
 on Amazon I face an authorization issue (my account works) and I get an
error when trying to test my credentials


*Error Message*
Guice creation errors: 1) org.jclouds.rest.RestContext cannot be used as a
key; It is not fully specified. 1 error

*Environment*
Ubuntu 12.04
Amazon m3.medium
Cloudera Express 5.1.1

*Research of the issue*
http://community.cloudera.com/t5/Cloudera-Manager-Installation/EC2-CDM-Installation-Fail/td-p/16160

*My Effort*
I installed java 7u45 and rebooted the machine as some forums suggest but
that has not fixed the issue.

I also installed an older version of Cloudera Manager 4.8.4 and was able to
get past the Authorization issue with Amazon using Cloudera install wizard
and contact with Amazon was successful and the instances were created.


Any suggestions?


Thanks for any information you can provide,

Adam


Job keeps running in LocalJobRunner under Cloudera 5.1

2014-08-22 Thread Something Something
Need some quick help.  Our job runs fine under MapR, but when we start the
same job on Cloudera 5.1, it keeps running in Local mode.

I am sure this is some kind of configuration issue. Any quick tips?

14/08/22 12:16:58 INFO mapreduce.Job: map 0% reduce 0%
14/08/22 12:17:03 INFO mapred.LocalJobRunner: map  map
14/08/22 12:17:06 INFO mapred.LocalJobRunner: map  map
14/08/22 12:17:09 INFO mapred.LocalJobRunner: map  map


Thanks.


How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Yuriy
Hadoop Writable interface relies on public void write(DataOutput out) method.
It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
which uses a simple array under the cover.

When I try to write a lot of data in DataOutput in my reducer, I get:

Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
limit at java.util.Arrays.copyOf(Arrays.java:3230) at
java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
java.io.DataOutputStream.write(DataOutputStream.java:107) at
java.io.FilterOutputStream.write(FilterOutputStream.java:97)

Looks like the system is unable to allocate the continuous array of the
requested size. Apparently, increasing the heap size available to the
reducer does not help - it is already at 84GB (-Xmx84G)

If I cannot reduce the size of the object that I need to serialize (as the
reducer constructs this object by combining the object data), what should I
try to work around this problem?

Thanks,

Yuriy


Re: How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Alexander Pivovarov
Max array size is max integer. So, byte array can not be bigger than 2GB
On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote:

 Hadoop Writable interface relies on public void write(DataOutput out) 
 method.
 It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
 which uses a simple array under the cover.

 When I try to write a lot of data in DataOutput in my reducer, I get:

 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit at java.util.Arrays.copyOf(Arrays.java:3230) at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
 java.io.DataOutputStream.write(DataOutputStream.java:107) at
 java.io.FilterOutputStream.write(FilterOutputStream.java:97)

 Looks like the system is unable to allocate the continuous array of the
 requested size. Apparently, increasing the heap size available to the
 reducer does not help - it is already at 84GB (-Xmx84G)

 If I cannot reduce the size of the object that I need to serialize (as the
 reducer constructs this object by combining the object data), what should I
 try to work around this problem?

 Thanks,

 Yuriy



Re: How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Yuriy
Thank you, Alexander. That, at least, explains the problem. And what should
be the workaround if the combined set of data is larger than 2 GB?


On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com
wrote:

 Max array size is max integer. So, byte array can not be bigger than 2GB
 On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote:

  Hadoop Writable interface relies on public void write(DataOutput out) 
 method.
 It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
 which uses a simple array under the cover.

 When I try to write a lot of data in DataOutput in my reducer, I get:

 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit at java.util.Arrays.copyOf(Arrays.java:3230) at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
 java.io.DataOutputStream.write(DataOutputStream.java:107) at
 java.io.FilterOutputStream.write(FilterOutputStream.java:97)

 Looks like the system is unable to allocate the continuous array of the
 requested size. Apparently, increasing the heap size available to the
 reducer does not help - it is already at 84GB (-Xmx84G)

 If I cannot reduce the size of the object that I need to serialize (as
 the reducer constructs this object by combining the object data), what
 should I try to work around this problem?

 Thanks,

 Yuriy




Re: How to serialize very large object in Hadoop Writable?

2014-08-22 Thread Alexander Pivovarov
Usually Hadoop Map Reduce deals with row based data.
ReduceContextKEYIN,VALUEIN,KEYOUT,VALUEOUT

if you need to write a lot to hdfs file you can get OutputStream to hdfs
file and write bytes.


On Fri, Aug 22, 2014 at 3:30 PM, Yuriy yuriythe...@gmail.com wrote:

 Thank you, Alexander. That, at least, explains the problem. And what
 should be the workaround if the combined set of data is larger than 2 GB?


 On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com
  wrote:

 Max array size is max integer. So, byte array can not be bigger than 2GB
 On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote:

  Hadoop Writable interface relies on public void write(DataOutput out) 
 method.
 It looks like behind DataOutput interface, Hadoop uses DataOutputStream,
 which uses a simple array under the cover.

 When I try to write a lot of data in DataOutput in my reducer, I get:

 Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM
 limit at java.util.Arrays.copyOf(Arrays.java:3230) at
 java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at
 java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
 at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at
 java.io.DataOutputStream.write(DataOutputStream.java:107) at
 java.io.FilterOutputStream.write(FilterOutputStream.java:97)

 Looks like the system is unable to allocate the continuous array of the
 requested size. Apparently, increasing the heap size available to the
 reducer does not help - it is already at 84GB (-Xmx84G)

 If I cannot reduce the size of the object that I need to serialize (as
 the reducer constructs this object by combining the object data), what
 should I try to work around this problem?

 Thanks,

 Yuriy





Re: job.getCounters returns null in Yarn-based job

2014-08-22 Thread Shahab Yunus
For those who are interested, this got resolved.

The issue was that I was creating more counters than what was configured in
the settings.
I upped mapreduce.job.counters.max property to a larger number. The default
was 120.

The job finishes now and I am able to print and get counters as well.

One minor thing that now the job history UI does not show the history with
the error message that max counter increased.

Regards,
Shahab


On Fri, Aug 22, 2014 at 7:59 AM, Shahab Yunus shahab.yu...@gmail.com
wrote:

 Hello.

 I am trying to access custom counters that I have created in an mapreduce
 job on Yarn.

 After job.waitForCompletion(true) call, I try to do job.getCounters() but
 I get a null.

 This only happens if I run a heavy job meaning a) a lot of data and b) lot
 of reducers.

 E.g. for 10million records with 20 reducers on a 10 node cluster it works.
 But on 60million records with 70 reducers on a 10 node cluster it doesn't.

 The job itself competes successfully.

 I did see the following related JIRAs. But the first one is for old
 version, pre-Yarn and those properties I think are not valid anymore.

 The second one does not seem to provide a solution? I tried using the
 suggested trick on the client side but no success there either.

 MAPREDUCE-1920 https://issues.apache.org/jira/browse/MAPREDUCE-1920
 MAPREDUCE-4442 https://issues.apache.org/jira/browse/MAPREDUCE-4442

 Please advise, how can I retrieve my custom counters after jobs
 completion. Am I missing something? Do I need to configure some job history
 stuff?  I do a mention of ATS as well but I don't know how much that is
 applicable here.

 Thanks a lot.

 My version is: 2.3.0-cdh5.1.0

 Regards,
 Shahab



Re: hdfs dfsclient, possible to force storage datanode ?

2014-08-22 Thread norbi

can't find infos about that in 2.5 documentation and changelog?

Am 22.08.2014 um 09:27 schrieb Tirupati Reddy:

Hadoop 2.5

On Thursday, August 21, 2014, norbi no...@rocknob.de 
mailto:no...@rocknob.de wrote:


hadoop 2.0 (cloudera cdh 4.7)

Am 21.08.2014 um 16:23 schrieb Liu, Yi A:

Which version are you using?

Regards,
Yi Liu


-Original Message-
From: norbi [mailto:no...@rocknob.de]
Sent: Wednesday, August 20, 2014 10:14 PM
To: user@hadoop.apache.org
Subject: hdfs dfsclient, possible to force storage datanode ?

hi list,

we have 52 DNs and more hundred clients they are store and
read data from hdfs. one rack has 3 DNs and about 15 clients.

it is possible to force (if space is available) that  these 15
clients prefer the 3 DNs located in there own rack to store
and read data?

racklocation.conf with org.apache.hadoop.net
http://org.apache.hadoop.net.NetworkTopology is allready in
use, but this does not help in this case

thanks




--
Tirru




Re: Appending to HDFS file

2014-08-22 Thread Jagat Singh
What is value of dfs.support.append in hdfs-site.cml

https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml




On Sat, Aug 23, 2014 at 1:41 AM, rab ra rab...@gmail.com wrote:

 Hello,

 I am currently using Hadoop 2.4.1.I am running a MR job using hadoop
 streaming utility.

 The executable needs to write large amount of information in a file.
 However, this write is not done in single attempt. The file needs to be
 appended with streams of information generated.

 In the code, inside a loop, I open a file in hdfs, appends some
 information. This is not working and I see only the last write.

 How do I accomplish append operation in hadoop? Can anyone share a pointer
 to me?




 regards
 Bala