Re: hdfs dfsclient, possible to force storage datanode ?
Hadoop 2.5 On Thursday, August 21, 2014, norbi no...@rocknob.de wrote: hadoop 2.0 (cloudera cdh 4.7) Am 21.08.2014 um 16:23 schrieb Liu, Yi A: Which version are you using? Regards, Yi Liu -Original Message- From: norbi [mailto:no...@rocknob.de] Sent: Wednesday, August 20, 2014 10:14 PM To: user@hadoop.apache.org Subject: hdfs dfsclient, possible to force storage datanode ? hi list, we have 52 DNs and more hundred clients they are store and read data from hdfs. one rack has 3 DNs and about 15 clients. it is possible to force (if space is available) that these 15 clients prefer the 3 DNs located in there own rack to store and read data? racklocation.conf with org.apache.hadoop.net.NetworkTopology is allready in use, but this does not help in this case thanks -- Tirru
WebHdfs config problem
Hi all, I've installed HDP 2.1 on CentOS 6.5, but I'm having a problem with WebHDFS. When I try to use the file browser or design an oozie workflow in Hue, I get a WebHdfs error. Attached is the error for the filebrowser. It appears to be some kind of permissions error, but I have hdfs security turned off, and web hdfs is enabled. I've followed all the Hue setup instructions I can find and made sure all the properties are set correctly. Can anyone shed some light? Thanks, Charles WebHdfsException at /filebrowser/ HTTPConnectionPool(host='localhost', port=50070): Max retries exceeded with url: /webhdfs/v1/user/admin?op=GETFILESTATUSuser.name=huedoas=admin (Caused by class 'socket.error': [Errno 111] Connection refused) Request Method: GET Request URL:http://[MyIP]:8000/filebrowser/ Django Version: 1.2.3 Exception Type: WebHdfsException Exception Value: HTTPConnectionPool(host='localhost', port=50070): Max retries exceeded with url: /webhdfs/v1/user/admin?op=GETFILESTATUSuser.name=huedoas=admin (Caused by class 'socket.error': [Errno 111] Connection refused) Exception Location: /usr/lib/hue/desktop/libs/hadoop/src/hadoop/fs/webhdfs.py in _stats, line 209 Python Executable: /usr/bin/python2.6 Python Version: 2.6.6 Python Path: ['/usr/lib/hue/build/env/lib/python2.6/site-packages/setuptools-0.6c11-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pip-0.6.3-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Babel-0.9.6-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/BabelDjango-0.2.2-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Django-1.2.3-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Mako-0.7.2-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Markdown-2.0.3-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/MarkupSafe-0.9.3-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/MySQL_python-1.2.3c1-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Paste-1.7.2-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/PyYAML-3.09-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Pygments-1.3.1-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/South-0.7-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/Spawning-0.9.6-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/avro-1.5.0-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/configobj-4.6.0-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/django_auth_ldap-1.0.7-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/django_extensions-0.5-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/django_nose-0.5-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/elementtree-1.2.6_20050316-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/enum-0.4.4-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/eventlet-0.9.14-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/greenlet-0.3.1-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/happybase-0.6-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/kerberos-1.1.1-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/lockfile-0.8-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/lxml-3.3.5-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/moxy-1.0.0-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pam-0.1.3-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pyOpenSSL-0.13-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pycrypto-2.6-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pysqlite-2.5.5-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/python_daemon-1.5.1-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/python_ldap-2.3.13-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/pytidylib-0.2.1-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/requests-2.2.1-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/requests_kerberos-0.4-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/sasl-0.1.1-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/sh-1.08-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/simplejson-2.0.9-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/threadframe-0.2-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/thrift-0.9.0-py2.6-linux-x86_64.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/urllib2_kerberos-0.1.6-py2.6.egg', '/usr/lib/hue/build/env/lib/python2.6/site-packages/xlrd-0.9.0-py2.6.egg',
job.getCounters returns null in Yarn-based job
Hello. I am trying to access custom counters that I have created in an mapreduce job on Yarn. After job.waitForCompletion(true) call, I try to do job.getCounters() but I get a null. This only happens if I run a heavy job meaning a) a lot of data and b) lot of reducers. E.g. for 10million records with 20 reducers on a 10 node cluster it works. But on 60million records with 70 reducers on a 10 node cluster it doesn't. The job itself competes successfully. I did see the following related JIRAs. But the first one is for old version, pre-Yarn and those properties I think are not valid anymore. The second one does not seem to provide a solution? I tried using the suggested trick on the client side but no success there either. MAPREDUCE-1920 https://issues.apache.org/jira/browse/MAPREDUCE-1920 MAPREDUCE-4442 https://issues.apache.org/jira/browse/MAPREDUCE-4442 Please advise, how can I retrieve my custom counters after jobs completion. Am I missing something? Do I need to configure some job history stuff? I do a mention of ATS as well but I don't know how much that is applicable here. Thanks a lot. My version is: 2.3.0-cdh5.1.0 Regards, Shahab
Hadoop 2.5.0 - HDFS browser-based file view
All, I noticed that that on Hadoop 2.5.0, when browsing the HDFS filesystem on port 50070, you can't view a file in the browser. Clicking a file gives a little popup with metadata and a download link. Can HDFS be configured to show plaintext file contents in the browser? Thanks, Brian
Appending to HDFS file
Hello, I am currently using Hadoop 2.4.1.I am running a MR job using hadoop streaming utility. The executable needs to write large amount of information in a file. However, this write is not done in single attempt. The file needs to be appended with streams of information generated. In the code, inside a loop, I open a file in hdfs, appends some information. This is not working and I see only the last write. How do I accomplish append operation in hadoop? Can anyone share a pointer to me? regards Bala
Basic Hadoop 2.3 32-bit VM for general Hadoop Users
We have released a very basic 32-bit VM (VirtualBox Image) for those users who want to get started with Hadoop, without worrying about configuration and dependencies. We have used CDH5.1 for this release which contains Hadoop 2.3 (YARN), Pig 0.12, Hive 0.12, Sqoop 1.4.4 along with MySQL with 814MB of download size. We have also packaged a simple use case of Wiki Page Hits analysis, which is explained in our blog here at www.lighthadoop.com. This is a genuine effort to help adoption of Hadoop and its eco-system fast, especially for students, from our freelancing big data enthusiasts to get started with latest Hadoop, Pig and Hive. Thus reducing time and effort in installating and configuring the system, still keeping the Hardware requirement low. The motivation behind this VM is there are users who own 32-bit systems(can address 4GB RAM, thats enough for basic Hadoop setup) and still want to try latest stable Hadoop. Thus enabling them to solve a use case without needing to buy a latest PC/Laptop with high amount of RAM. Kindly send your feedback/suggestions to supp...@lighthadoop.com . All suggestions welcome! Suggestions make us grow, thus serving more opensource community! Thanks! LightHadoop Team
Issues installing Cloudera Manager 5.1.1 on Amazon EC2 - Cloud Express Wizard
Hi everyone, *Problem* I am having some trouble spinning up additional instances on Amazon using Cloudera Express / Cloudera Manager 5.1.1. I am able to install Cloudera manager on the Host machine through the Cloudera installation wizard. But I cannot spin up additional machines due to an authorization issue which appears to be invalid. *Note my credintials work when installing an older version of Cloudera Manager 4.8.4 *Error occurs when* However when I try to spin up additional instances (step 3 of the Cloud Express Wizard) on Amazon I face an authorization issue (my account works) and I get an error when trying to test my credentials *Error Message* Guice creation errors: 1) org.jclouds.rest.RestContext cannot be used as a key; It is not fully specified. 1 error *Environment* Ubuntu 12.04 Amazon m3.medium Cloudera Express 5.1.1 *Research of the issue* http://community.cloudera.com/t5/Cloudera-Manager-Installation/EC2-CDM-Installation-Fail/td-p/16160 *My Effort* I installed java 7u45 and rebooted the machine as some forums suggest but that has not fixed the issue. I also installed an older version of Cloudera Manager 4.8.4 and was able to get past the Authorization issue with Amazon using Cloudera install wizard and contact with Amazon was successful and the instances were created. Any suggestions? Thanks for any information you can provide, Adam
Job keeps running in LocalJobRunner under Cloudera 5.1
Need some quick help. Our job runs fine under MapR, but when we start the same job on Cloudera 5.1, it keeps running in Local mode. I am sure this is some kind of configuration issue. Any quick tips? 14/08/22 12:16:58 INFO mapreduce.Job: map 0% reduce 0% 14/08/22 12:17:03 INFO mapred.LocalJobRunner: map map 14/08/22 12:17:06 INFO mapred.LocalJobRunner: map map 14/08/22 12:17:09 INFO mapred.LocalJobRunner: map map Thanks.
How to serialize very large object in Hadoop Writable?
Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: How to serialize very large object in Hadoop Writable?
Max array size is max integer. So, byte array can not be bigger than 2GB On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote: Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: How to serialize very large object in Hadoop Writable?
Thank you, Alexander. That, at least, explains the problem. And what should be the workaround if the combined set of data is larger than 2 GB? On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com wrote: Max array size is max integer. So, byte array can not be bigger than 2GB On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote: Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: How to serialize very large object in Hadoop Writable?
Usually Hadoop Map Reduce deals with row based data. ReduceContextKEYIN,VALUEIN,KEYOUT,VALUEOUT if you need to write a lot to hdfs file you can get OutputStream to hdfs file and write bytes. On Fri, Aug 22, 2014 at 3:30 PM, Yuriy yuriythe...@gmail.com wrote: Thank you, Alexander. That, at least, explains the problem. And what should be the workaround if the combined set of data is larger than 2 GB? On Fri, Aug 22, 2014 at 1:50 PM, Alexander Pivovarov apivova...@gmail.com wrote: Max array size is max integer. So, byte array can not be bigger than 2GB On Aug 22, 2014 1:41 PM, Yuriy yuriythe...@gmail.com wrote: Hadoop Writable interface relies on public void write(DataOutput out) method. It looks like behind DataOutput interface, Hadoop uses DataOutputStream, which uses a simple array under the cover. When I try to write a lot of data in DataOutput in my reducer, I get: Caused by: java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3230) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.DataOutputStream.write(DataOutputStream.java:107) at java.io.FilterOutputStream.write(FilterOutputStream.java:97) Looks like the system is unable to allocate the continuous array of the requested size. Apparently, increasing the heap size available to the reducer does not help - it is already at 84GB (-Xmx84G) If I cannot reduce the size of the object that I need to serialize (as the reducer constructs this object by combining the object data), what should I try to work around this problem? Thanks, Yuriy
Re: job.getCounters returns null in Yarn-based job
For those who are interested, this got resolved. The issue was that I was creating more counters than what was configured in the settings. I upped mapreduce.job.counters.max property to a larger number. The default was 120. The job finishes now and I am able to print and get counters as well. One minor thing that now the job history UI does not show the history with the error message that max counter increased. Regards, Shahab On Fri, Aug 22, 2014 at 7:59 AM, Shahab Yunus shahab.yu...@gmail.com wrote: Hello. I am trying to access custom counters that I have created in an mapreduce job on Yarn. After job.waitForCompletion(true) call, I try to do job.getCounters() but I get a null. This only happens if I run a heavy job meaning a) a lot of data and b) lot of reducers. E.g. for 10million records with 20 reducers on a 10 node cluster it works. But on 60million records with 70 reducers on a 10 node cluster it doesn't. The job itself competes successfully. I did see the following related JIRAs. But the first one is for old version, pre-Yarn and those properties I think are not valid anymore. The second one does not seem to provide a solution? I tried using the suggested trick on the client side but no success there either. MAPREDUCE-1920 https://issues.apache.org/jira/browse/MAPREDUCE-1920 MAPREDUCE-4442 https://issues.apache.org/jira/browse/MAPREDUCE-4442 Please advise, how can I retrieve my custom counters after jobs completion. Am I missing something? Do I need to configure some job history stuff? I do a mention of ATS as well but I don't know how much that is applicable here. Thanks a lot. My version is: 2.3.0-cdh5.1.0 Regards, Shahab
Re: hdfs dfsclient, possible to force storage datanode ?
can't find infos about that in 2.5 documentation and changelog? Am 22.08.2014 um 09:27 schrieb Tirupati Reddy: Hadoop 2.5 On Thursday, August 21, 2014, norbi no...@rocknob.de mailto:no...@rocknob.de wrote: hadoop 2.0 (cloudera cdh 4.7) Am 21.08.2014 um 16:23 schrieb Liu, Yi A: Which version are you using? Regards, Yi Liu -Original Message- From: norbi [mailto:no...@rocknob.de] Sent: Wednesday, August 20, 2014 10:14 PM To: user@hadoop.apache.org Subject: hdfs dfsclient, possible to force storage datanode ? hi list, we have 52 DNs and more hundred clients they are store and read data from hdfs. one rack has 3 DNs and about 15 clients. it is possible to force (if space is available) that these 15 clients prefer the 3 DNs located in there own rack to store and read data? racklocation.conf with org.apache.hadoop.net http://org.apache.hadoop.net.NetworkTopology is allready in use, but this does not help in this case thanks -- Tirru
Re: Appending to HDFS file
What is value of dfs.support.append in hdfs-site.cml https://hadoop.apache.org/docs/r2.3.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml On Sat, Aug 23, 2014 at 1:41 AM, rab ra rab...@gmail.com wrote: Hello, I am currently using Hadoop 2.4.1.I am running a MR job using hadoop streaming utility. The executable needs to write large amount of information in a file. However, this write is not done in single attempt. The file needs to be appended with streams of information generated. In the code, inside a loop, I open a file in hdfs, appends some information. This is not working and I see only the last write. How do I accomplish append operation in hadoop? Can anyone share a pointer to me? regards Bala