Re: Update UGI with new tokens during the lifespan of a yarn application

2024-06-11 Thread Wei-Chiu Chuang
That sounds like what Spark did.
Take a look at this doc
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/security/README.md
The Spark AM has a Kerberos keytab and it periodically acquires a new
delegation token (the old one is ignored) to make sure it always has a
valid DT. Finally, distribute the DT to all executors.

On Tue, Jun 11, 2024 at 4:34 AM Ankur Khanna
 wrote:

> Hi experts,
>
>
>
> I have a use-case with an external session token that is short lived and
> does not renew(ie, unlike a hadoop delegation token, the expiry time is not
> updated for this token). For a long running application (longer than the
> lifespan of the external token), I want to update the UGI/Credential object
> of each and every worker container with a new token.
>
> If I understand correctly, all delegation tokens are shared at the launch
> of a container.
>
> Is there any way to update the credential object after the launch of the
> container and during the lifespan of the application?
>
>
> Best,
>
> Ankur Khanna
>
>
>
>
>


Re: Namenode Connection Refused

2023-10-24 Thread Wei-Chiu Chuang
If it's an HA cluster, is it possible the client doesn't have the proper HA
configuration so it doesn't know what host name to connect to?

Otherwise, the usual suspect is the firewall configuration between the
client and the NameNode.

On Tue, Oct 24, 2023 at 9:05 AM Harry Jamison
 wrote:

> I feel like I am doing something really dumb here, but my namenode is
> having a connection refused on port 8020.
>
> There is nothing in the logs that seems to indicate an error as far as I
> can tell
>
> ps aux shows the namenode is running
>
> root   13169   10196  9 21:18 pts/100:00:02
> /usr/lib/jvm/java-11-openjdk-amd64//bin/java -Dproc_namenode
> -Djava.net.preferIPv4Stack=true -Dhdfs.audit.logger=INFO,NullAppender
> -Dhadoop.security.logger=INFO,RFAS
> -Dyarn.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs -Dyarn.log.file=hadoop.log
> -Dyarn.home.dir=/hadoop/hadoop/hadoop-3.3.6 -Dyarn.root.logger=INFO,console
> -Djava.library.path=/hadoop/hadoop/hadoop-3.3.6/lib/native
> -Dhadoop.log.dir=/hadoop/hadoop/hadoop-3.3.6/logs
> -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=/hadoop/hadoop/hadoop-3.3.6
> -Dhadoop.id.str=root -Dhadoop.root.logger=TRACE,console
> -Dhadoop.policy.file=hadoop-policy.xml
> org.apache.hadoop.hdfs.server.namenode.NameNode
>
> Netstat shows that this port is not open but others are
> root@vmnode1:/hadoop/hadoop/hadoop# netstat -tulapn|grep 802
> tcp0  0 192.168.1.159:8023  0.0.0.0:*
>  LISTEN  16347/java
> tcp0  0 192.168.1.159:8022  0.0.0.0:*
>  LISTEN  16347/java
> tcp0  0 192.168.1.159:8022  192.168.1.159:56830
>  ESTABLISHED 16347/java
> tcp0  0 192.168.1.159:56830 192.168.1.159:8022
> ESTABLISHED 13889/java
> tcp0  0 192.168.1.159:8022  192.168.1.104:58264
>  ESTABLISHED 16347/java
>
>
> From the namenode logs I see that it has 8020 as the expected port
> [2023-10-23 21:18:21,739] INFO fs.defaultFS is hdfs://vmnode1:8020/
> (org.apache.hadoop.hdfs.server.namenode.NameNodeUtils)
> [2023-10-23 21:18:21,739] INFO Clients should use vmnode1:8020 to access
> this namenode/service. (org.apache.hadoop.hdfs.server.namenode.NameNode)
>
> My datanodes seem to be connecting, because I see that information about 0
> invalid blocks in the logs
> [2023-10-24 09:03:21,255] INFO BLOCK* registerDatanode: from
> DatanodeRegistration(192.168.1.159:9866,
> datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f, infoPort=9864,
> infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
> storage fbefce35-15f7-43df-a666-ecc90f4bef0f
> (org.apache.hadoop.hdfs.StateChange)
> [2023-10-24 09:03:21,255] INFO Removing a node: /default-rack/
> 192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,255] INFO Adding a new node: /default-rack/
> 192.168.1.159:9866 (org.apache.hadoop.net.NetworkTopology)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb
> with lease ID 0xa39c5071fd7ca21f: Processing first storage report for
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb from datanode DatanodeRegistration(
> 192.168.1.159:9866, datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f,
> infoPort=9864, infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786)
> (BlockStateChange)
> [2023-10-24 09:03:21,281] INFO BLOCK* processReport 0x746ca82e1993dcbb
> with lease ID 0xa39c5071fd7ca21f: from storage
> DS-ab8f27ed-6129-492c-9b8a-3800c46703fb node DatanodeRegistration(
> 192.168.1.159:9866, datanodeUuid=fbefce35-15f7-43df-a666-ecc90f4bef0f,
> infoPort=9864, infoSecurePort=0, ipcPort=9867,
> storageInfo=lv=-57;cid=CID-0b66d2f6-6c6a-4f3f-bdb1-b1ab0c947d00;nsid=2036303633;c=1697774550786),
> blocks: 0, hasStaleStorage: false, processing time: 0 msecs,
> invalidatedBlocks: 0 (BlockStateChange)
>
>
> Is there anything else that I should look at?
> I am not sure how to debug why it is not starting up on this port
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Fwd: Join us at the Storage User Group Meetup!

2023-10-17 Thread Wei-Chiu Chuang
-- Forwarded message -
From: Wei-Chiu Chuang 
Date: Mon, Oct 16, 2023 at 11:28 AM
Subject: Join us at the Storage User Group Meetup!
To: Hdfs-dev 


Hi

Please join us at the Storage Meetup at Cloudera's office next Wednesday.
https://www.meetup.com/futureofdata-sanfrancisco/events/295917033/

We have HDFS developers from Uber join us to talk about optimizing HDFS for
high density disks, and developers from Cloudera to talk about Apache Ozone
and Apache Iceberg.

I am being told this is an in-person event but it will be live streamed
too. Please sign up to get more details about this event.

Thanks,
Wei-Chiu


Re: Compare hadoop and ytsaurus

2023-09-28 Thread Wei-Chiu Chuang
Hey Kirill,

Thanks for sharing! I wasn't aware of this project. According to the blog
post
https://medium.com/yandex/ytsaurus-exabyte-scale-storage-and-processing-system-is-now-open-source-42e7f5fa5fc6
It was released in public earlier this year by Yandex.

It was inspired by Google's MapReduce, so it has the same root as Hadoop
but I don't think they use the same code. Looks like a very mature project
with more than 60 thousand commits in the repo.

Maybe I'll put it this way, an entire Hadoop ecosystem in a parallel
universe. (Hats off to YTsaurus developers). It's got its own scheduler
similar to YARN, dynamic table support like HBase, query engine similar to
Hive, consensus protocol similar to Raft (we have Apache Zookeeper and
Ratis)


On Thu, Sep 28, 2023 at 1:46 AM Kirill  wrote:

> Hi everyone!
>
> Have you seen this platform https://ytsaurus.tech/platform-overview ?
> What do you think? Has somebody tried it?
> Is it based on Hadoop source code? It is claimed that there is also a
> MapReduce in it.
> Is it possible to run Hadoop programs and Hive queries on ytsaurus?
>
>
>
> Regards,
> Kirill
>


Re: Deploy multi-node Hadoop with Docker

2023-09-22 Thread Wei-Chiu Chuang
The Hadoop's docker image is not for production use. That's why

But we should update that if people are thinking to use it for production.
Not familiar with docker compose but contributions welcomed:
https://github.com/apache/hadoop/blob/docker-hadoop-3/docker-compose.yaml

On Fri, Sep 22, 2023 at 5:44 AM Nikos Spanos 
wrote:

> Hi,
>
>
>
> I am creating a multi-node Hadoop cluster for a personal project, and I
> would like to use the official docker image (apache/hadoop
> ).
>
>
>
> However, looking at the official docker image documentation and the
> docker-compose file I have seen the following environment variable:
>
>
>
> environment:
>
>   ENSURE_NAMENODE_DIR: "/tmp/hadoop-root/dfs/name"
>
>
>
> I would like to know if it is safe to create the namenode directory in the
> /tmp folder since this kind of folder is neither secure nor data
> persistent. Thus, I would like to understand which path is the best
> practice for this. Moreover, which are other environment variables I could
> use of.
>
>
>
> Thanks a lot, in advance.
>
>
>
> Kind regards,
>
>
>
> *Nikos Spanos*
>
>
>
> M.Sc Business Analytics & Big Data| Athens University of Economics &
> Business
>
> Phone Number: +306982310494
>
> Linkedin profile  
>
>
>


[ANNOUNCE] Apache Hadoop 3.3.6 release

2023-06-26 Thread Wei-Chiu Chuang
On behalf of the Apache Hadoop Project Management Committee, I am pleased
to announce the release of Apache Hadoop 3.3.6.

It contains 117 bug fixes, improvements and enhancements since 3.3.5. Users
of Apache Hadoop 3.3.5 and earlier should upgrade to this release.

https://hadoop.apache.org/release/3.3.6.html
Feature highlights:

SBOM artifacts

Starting from this release, Hadoop publishes Software Bill of Materials
(SBOM) using
CycloneDX Maven plugin. For more information about SBOM, please go to
[SBOM](https://cwiki.apache.org/confluence/display/COMDEV/SBOM).

HDFS RBF: RDBMS based token storage support

HDFS Router-Router Based Federation now supports storing delegation tokens
on MySQL,
[HADOOP-18535](https://issues.apache.org/jira/browse/HADOOP-18535)
which improves token operation through over the original Zookeeper-based
implementation.


New File System APIs

[HADOOP-18671](https://issues.apache.org/jira/browse/HADOOP-18671) moved a
number of
HDFS-specific APIs to Hadoop Common to make it possible for certain
applications that
depend on HDFS semantics to run on other Hadoop compatible file systems.

In particular, recoverLease() and isFileClosed() are exposed through
LeaseRecoverable
interface. While setSafeMode() is exposed through SafeMode interface.

Many thanks to everyone who helped in this release by supplying patches,
reviewing them, helping get this release building and testing and
reviewing the final artifacts.

Weichiu


Re: Monitoring HDFS filesystem changes

2023-02-15 Thread Wei-Chiu Chuang
Use the inotify api

https://dev-listener.medium.com/watch-for-changes-in-hdfs-800c6fb5481f
https://github.com/onefoursix/hdfs-inotify-example/blob/master/src/main/java/com/onefoursix/HdfsINotifyExample.java


On Wed, Feb 15, 2023 at 1:12 AM  wrote:

> Hello,
> is there an efficient way to monitoring the HDFS Filesystem for
> owner-right changes?
> For instance, let's say the /a/b/c/d HDFS Directory's owner is called
> user1.
> However, overnight, the owner changed for some unknown reason.
> How can I monitor the /a/b/c/d directory and determine what caused the
> owner to change?
> Many thanks.
> Best regards,
> Philippe
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: CVE-2022-42889

2022-10-27 Thread Wei-Chiu Chuang
   1. HADOOP-18497 


On Thu, Oct 27, 2022 at 4:45 AM Deepti Sharma S
 wrote:

> Hello Team,
>
>
>
> As we have received the vulnerability “CVE-2022-42889”. We are using
> Apache Hadoop common 3pp version 3.3.3 which has transitive dependency of
> Common text.
>
>
>
> Do you have any plans to fix this vulnerability in Hadoop next version and
> when is the plan?
>
>
>
>
>
> Regards,
>
> Deepti Sharma
> * PMP® & ITIL*
>
>
>


Re: Performance with large no of files

2022-10-10 Thread Wei-Chiu Chuang
Do you have security enabled?

We did some preliminary benchmarks around webhdfs (i really want to revisit
it again) and with security enabled, a lot of overhead is between client
and KDC (SPENGO). Try run webhdfs using delegation tokens should help
remove that bottleneck.

On Sat, Oct 8, 2022 at 8:26 PM Abhishek  wrote:

> Hi,
> We want to backup large no of hadoop small files (~1mn) with webhdfs API
> We are getting a performance bottleneck here and it's taking days to back
> it up.
> Anyone know any solution where performance could be improved using any xml
> settings?
> This would really help us.
> v 3.1.1
>
> Appreciate your help !!
>
> --
>
>
>
>
>
>
>
>
>
>
>
>
>
> ~
> *Abhishek...*
>


Re: Hdfs namenode consume much memory that are not expected

2022-08-01 Thread Wei-Chiu Chuang
Hi
Not familiar with pmap, but
G1GC is not recommended for such a big heap.

To troubleshoot further, I usually run jmp -histo to get a list of top
objects that use the most memory heap.

On Mon, Aug 1, 2022 at 3:08 AM Micro dong  wrote:

>
>  we deploy hdfs in our company.  we meet a unormal situation.
> we set heap memory to 280G, but we actually consume 450G.
>
> *we seen this by pmap.*
>
>
>
>
>
>
>
>
>
>
>
> *   2ae91c00 rw-p   00:00 0 294174720 293705824
> 293705824  293705824 2937058240  0 0198c000 rw-p 
>  00:00 0 173517492 173513320 173513320  173513320 1735133200
>0 [heap]2b2fa800 rw-p   00:00 0  11026824
>  11007512  11007512   11007512  110075120  0 2b2f6200 rw-p
>   00:00 0   1146880   1014280   10142801014280
> 10142800  0 2ae9072b rwxp   00:00 0
> 71808 69280 69280  69280 692800  0 2ae917d88000
> rw-p   00:00 0 25696 24700 24700  24700
> 247000  0 2b325732f000 rw-p   00:00 0 16384
> 16328 16328  16328 163280  0 2ae91aeb9000 rw-p
>   00:00 0  9988  8972  8972   8972
>  89720  0 2b3249063000 rw-p   00:00 0  9216
>  8204  8204   8204  82040  0 2ae905332000 r-xp
>   08:02   8391747 13168  6808  2825   6804
> 00  0 libjvm.so*
>
>
> * additional information:*
> *hadoop version: 2.7.2*
> * -Xms280G *
> *-Xmx280G *
> *-XX:MaxDirectMemorySize=10G *
> *-XX:MetaspaceSize=128M*
> *-XX:+UseG1GC*
>
> any idea will be appreciated
>


Re: Questions regarding setting AWS application load balancer for YARN RM

2022-06-15 Thread Wei-Chiu Chuang
Not familiar with AWS but this warning can be work arounded by upping the
IPC default length limit:
See if you can update core-site.xml, change the
property ipc.maximum.response.length which has the default of 128MB
(=128*1024*1024) to something bigger, such as 256MB.

On Thu, Jun 16, 2022 at 10:05 AM Leon Xu  wrote:

> Hi Hadoop/Yarn Users,
>
> I am trying to set up AWS ALB(application load balancer) for YARN resource
> managers. I am wondering if anyone has experience on that?
> I am able to connect my yarn client to the YARN RM instance directly. But
> after I set up the ALB and try to connect through the ALB, I am getting
> this error:
>
> *java.io.IOException: Failed on local exception:
> org.apache.hadoop.ipc.RpcException: RPC response exceeds maximum data
> length;*
>
> Wondering if anyone has ideas
>
> Thanks
> Leon
>
>
>


Re: Quick check on Log4j/Reload4j plan

2022-03-04 Thread Wei-Chiu Chuang
That would be great! Would you like to start another thread to kick off the
2.10.x release plan?

On Thu, Mar 3, 2022 at 9:39 PM Masatake Iwasaki 
wrote:

> Hi Wei-Chiu Chuang,
>
> > I think a bigger question is whether or not we have someone who would
> like to volunteer to be a release manager for the 2.10.2 release.
> > The last 2.x release was over a year ago.
>
> I can take a RM role if there are needs.
>
> Thanks,
> Masatake Iwasaki
>
> On 2022/03/02 5:54, Wei-Chiu Chuang wrote:
> >
> >
> > On Wed, Mar 2, 2022 at 2:43 AM Brent  brentwritesc...@gmail.com>> wrote:
> >
> > Hey all,
> >
> > I've been trying to go through Jira issues and mailing list archives
> to understand ongoing plans for Log4j 1.x upgrades.  I know technically
> Hadoop is not listed as vulnerable, but some more cautious organizations
> are looking to upgrade anyway.
> >
> > It seems like 3.4.x and beyond releases are talking about moving to
> Log4j2 or Logback (per https://issues.apache.org/jira/browse/HADOOP-12956
> <https://issues.apache.org/jira/browse/HADOOP-12956> and
> https://issues.apache.org/jira/browse/HADOOP-16206 <
> https://issues.apache.org/jira/browse/HADOOP-16206>).
> >
> > It seems like 3.2.x and 3.3.x are talking about moving to
> Reload4j (per https://issues.apache.org/jira/browse/HADOOP-18088 <
> https://issues.apache.org/jira/browse/HADOOP-18088> and
> https://github.com/apache/hadoop/pull/3906 <
> https://github.com/apache/hadoop/pull/3906>).
> >
> > Two questions:
> > - Does that sound accurate?
> >
> > That sounds about right.
> >
> > - Are there any plans to patch Reload4j back into 2.x releases as
> well?
> >
> >
> > I think a bigger question is whether or not we have someone who would
> like to volunteer to be a release manager for the 2.10.2 release.
> > The last 2.x release was over a year ago.
> >
> >
> > Thank you for your time and help and all your hard work on this
> project!
> >
> > ~Brent
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: [ANNOUNCE] Apache Hadoop 3.3.2 release

2022-03-03 Thread Wei-Chiu Chuang
Thanks a lot for the tremendous work!

On Fri, Mar 4, 2022 at 9:30 AM Chao Sun  wrote:

> Hi All,
>
> It gives me great pleasure to announce that the Apache Hadoop community has
> voted to release Apache Hadoop 3.3.2.
>
> This is the second stable release of Apache Hadoop 3.3 line. It contains
> 284 bug fixes, improvements and enhancements since 3.3.1.
>
> Users are encouraged to read the overview of major changes [1] since 3.3.1.
> For details of 284 bug fixes, improvements, and other enhancements since
> the previous 3.3.1 release, please check release notes [2] and changelog
> [3].
>
> [1]: https://hadoop.apache.org/docs/r3.3.2/index.html
> [2]:
>
> http://hadoop.apache.org/docs/r3.3.2/hadoop-project-dist/hadoop-common/release/3.3.2/RELEASENOTES.3.3.2.html
> [3]:
>
> http://hadoop.apache.org/docs/r3.3.2/hadoop-project-dist/hadoop-common/release/3.3.2/CHANGELOG.3.3.2.html
>
> Many thanks to everyone who contributed to the release, and everyone in the
> Apache Hadoop community! This release is a direct result of your great
> contributions.
>
> Many thanks to everyone who helped in this release process!
>
> Many thanks to Viraj Jasani, Michael Stack, Masatake Iwasaki, Xiaoqiao He,
> Mukund Madhav Thakur, Wei-Chiu Chuang, Steve Loughran, Akira Ajisaka and
> other folks who helped for this release process.
>
> Best Regards,
> Chao
>


Re: Quick check on Log4j/Reload4j plan

2022-03-01 Thread Wei-Chiu Chuang
On Wed, Mar 2, 2022 at 2:43 AM Brent  wrote:

> Hey all,
>
> I've been trying to go through Jira issues and mailing list archives to
> understand ongoing plans for Log4j 1.x upgrades.  I know technically Hadoop
> is not listed as vulnerable, but some more cautious organizations are
> looking to upgrade anyway.
>
> It seems like 3.4.x and beyond releases are talking about moving to Log4j2
> or Logback (per https://issues.apache.org/jira/browse/HADOOP-12956 and
> https://issues.apache.org/jira/browse/HADOOP-16206).
>
> It seems like 3.2.x and 3.3.x are talking about moving to Reload4j (per
> https://issues.apache.org/jira/browse/HADOOP-18088 and
> https://github.com/apache/hadoop/pull/3906).
>
> Two questions:
> - Does that sound accurate?
>
That sounds about right.

> - Are there any plans to patch Reload4j back into 2.x releases as well?
>

I think a bigger question is whether or not we have someone who would like
to volunteer to be a release manager for the 2.10.2 release.
The last 2.x release was over a year ago.

>
> Thank you for your time and help and all your hard work on this project!
>
> ~Brent
>


Re: Next Mandarin Hadoop Online Meetup Jan 6th.

2022-01-09 Thread Wei-Chiu Chuang
Hello

Thanks for joining this event.

The presentation slides (in English) is available at
https://drive.google.com/file/d/1PiZYhzxANqtoyO_nSLt_-v7aP3j17Sbg/view

The recording (in Mandarin) is available at
https://cloudera.zoom.us/rec/share/JaNm70lZQGCZdlFzh9ZbsfrR7MJ7Nazb2g6NCtYPqsRLWtyEhLfgwXOppzMR3csp.HqRJNGXUGSaPu1qw
Access Passcode: 4g1ZF&%f


On Mon, Jan 3, 2022 at 5:39 PM Wei-Chiu Chuang  wrote:

> Hello community,
>
> This week we'll going to have Tao Li (tomscut) speaking about  the
> experience of operating HDFS at BIGO. See you on Thursday!
>
> 题目:《HDFS在BIGO的实践》
> 概要:HDFS作为大数据底层存储服务,在BIGO的发展中起到了非常重要的作用。随着业务的发展和数据的爆发式增长,HDFS单个集群的瓶颈愈发凸显,我们借助Router,将多个HDFS集群整合成一个Namespace,增强集群的扩展能力;改造Router,使其支持Alluxio和自定义策略,并开启HDFS
> EC,实现热、温、冷数据分层存储。同时,通过对HDFS集群慢节点和慢盘的处理,提升了HDFS的读写性能。本次分享主要讲述BIGO对Router的实践经验,以及对慢节点和慢盘的处理经验。
> 关键词:Router,Slow Node,Slow Disk
> 演讲者:Tao Li (Apache id: tomscut)
>
> Date/Time: Jan 6 2PM Beijing Time.
>
> Zoom link: https://cloudera.zoom.us/j/97264903288
>
> One tap mobile
>
> +16465588656,,880548968# US (New York)
>
> +17207072699,,880548968# US
>
> Download Center <https://cloudera.zoom.us/j/880548968>
>
> Dial by your location
>
> +1 646 558 8656 US (New York)
>
> +1 720 707 2699 US
>
> 877 853 5257 US Toll-free
>
> 888 475 4499 US Toll-free
>
> Meeting ID: 972 6490 3288
> Find your local number: https://zoom.us/u/acaGRDfMVl
>


Re: Next Mandarin Hadoop Online Meetup Jan 6th.

2022-01-05 Thread Wei-Chiu Chuang
Just a gentle reminder this is happening now.

On Mon, Jan 3, 2022 at 5:39 PM Wei-Chiu Chuang  wrote:

> Hello community,
>
> This week we'll going to have Tao Li (tomscut) speaking about  the
> experience of operating HDFS at BIGO. See you on Thursday!
>
> 题目:《HDFS在BIGO的实践》
> 概要:HDFS作为大数据底层存储服务,在BIGO的发展中起到了非常重要的作用。随着业务的发展和数据的爆发式增长,HDFS单个集群的瓶颈愈发凸显,我们借助Router,将多个HDFS集群整合成一个Namespace,增强集群的扩展能力;改造Router,使其支持Alluxio和自定义策略,并开启HDFS
> EC,实现热、温、冷数据分层存储。同时,通过对HDFS集群慢节点和慢盘的处理,提升了HDFS的读写性能。本次分享主要讲述BIGO对Router的实践经验,以及对慢节点和慢盘的处理经验。
> 关键词:Router,Slow Node,Slow Disk
> 演讲者:Tao Li (Apache id: tomscut)
>
> Date/Time: Jan 6 2PM Beijing Time.
>
> Zoom link: https://cloudera.zoom.us/j/97264903288
>
> One tap mobile
>
> +16465588656,,880548968# US (New York)
>
> +17207072699,,880548968# US
>
> Download Center <https://cloudera.zoom.us/j/880548968>
>
> Dial by your location
>
> +1 646 558 8656 US (New York)
>
> +1 720 707 2699 US
>
> 877 853 5257 US Toll-free
>
> 888 475 4499 US Toll-free
>
> Meeting ID: 972 6490 3288
> Find your local number: https://zoom.us/u/acaGRDfMVl
>


Next Mandarin Hadoop Online Meetup Jan 6th.

2022-01-03 Thread Wei-Chiu Chuang
Hello community,

This week we'll going to have Tao Li (tomscut) speaking about  the
experience of operating HDFS at BIGO. See you on Thursday!

题目:《HDFS在BIGO的实践》
概要:HDFS作为大数据底层存储服务,在BIGO的发展中起到了非常重要的作用。随着业务的发展和数据的爆发式增长,HDFS单个集群的瓶颈愈发凸显,我们借助Router,将多个HDFS集群整合成一个Namespace,增强集群的扩展能力;改造Router,使其支持Alluxio和自定义策略,并开启HDFS
EC,实现热、温、冷数据分层存储。同时,通过对HDFS集群慢节点和慢盘的处理,提升了HDFS的读写性能。本次分享主要讲述BIGO对Router的实践经验,以及对慢节点和慢盘的处理经验。
关键词:Router,Slow Node,Slow Disk
演讲者:Tao Li (Apache id: tomscut)

Date/Time: Jan 6 2PM Beijing Time.

Zoom link: https://cloudera.zoom.us/j/97264903288

One tap mobile

+16465588656,,880548968# US (New York)

+17207072699,,880548968# US

Download Center 

Dial by your location

+1 646 558 8656 US (New York)

+1 720 707 2699 US

877 853 5257 US Toll-free

888 475 4499 US Toll-free

Meeting ID: 972 6490 3288
Find your local number: https://zoom.us/u/acaGRDfMVl


Apache Hadoop and CVE-2021-44228 Log4JShell vulnerability

2021-12-19 Thread Wei-Chiu Chuang
Hi,
Given the widespread attention to the recent log4j vulnerability
(CVE-2021-44228), I'd like to share an update from the Hadoop developer
community regarding the incident.

As you probably know, Apache Hadoop depends on the log4j library to keep
log files. The highlighted vulnerability CVE-2021-44228 affects log4j2
2.0-beta9 through 2.15.0. Hadoop has been using log4j 1.2.x in the last 10
years and therefore no release is affected by it.

That said, another CVE CVE-2021-4104 states the JMSAppender in the 1.2.x
log4j, which is used by Apache Hadoop, is vulnerable to the same attack.
Fortunately, it is not configured by default and Hadoop does not enable it
by default.

For more information and mitigation, please check out Hadoop's CVE list
page.
https://hadoop.apache.org/cve_list.html

Wei-Chiu


Re: Any comment on the log4j issue?

2021-12-17 Thread Wei-Chiu Chuang
I filed a jira HADOOP-18050
 and posted a PR to
document our stance on the log4jshell vulnerability. Please review.

On Fri, Dec 17, 2021 at 5:59 PM Brahma Reddy Battula 
wrote:

>
>
> CVE-2021-44228 states that, it will affect the Apache Log4j2 2.0-beta9
> through 2.12.1 and 2.13.0 through 2.15.0 JNDI features used in
> configuration, log messages, and parameters do not protect against attacker
> controlled LDAP and other JNDI related endpoints *And hadoop uses the
> log4j1 (1.2.17) so it will not impact.*
>
>
>
> Please go through the following link for affected apache projects.
>
> https://blogs.apache.org/security/entry/cve-2021-44228
>
> On Thu, Dec 16, 2021 at 4:25 PM Rupert Mazzucco 
> wrote:
>
>> The hadoop.apache.org page is curiously silent about this, and there is
>> no CVE. Isn't this library used in Hadoop? Pretty sure I saw
>> log4j.properties somewhere. Can anybody shed some light on the
>> vulnerability of a Hadoop installation? Can it be exploited via RPC? The
>> HDFS or YARN web interface? The command line?
>>
>> Thanks
>> Rupert
>>
>>
>
> --
>
>
>
> --Brahma Reddy Battula
>


Three Hadoop talks in this year's ApacheCon Asia

2021-07-12 Thread Wei-Chiu Chuang
For your information,

While drafting the upcoming quarterly report, I found there are three talks
that are directly related to Hadoop in this year's ApacheCon Asia.
https://apachecon.com/acasia2021/tracks/bigdata.html

Bigtop 3.0: Rerising community driven Hadoop distribution

Technical tips for secure Apache Hadoop cluster

Data Lake accelerator on Hadoop-COS in Tencent Cloud


There are other relevant talks in the Big Data track, including Ozone,
Impala, Parquet and so on. I am sure you'll find the talks useful.

The ApacheCon will take place online virtually between Aug 6-8.

Thanks,
Wei-Chiu


[ANNOUNCE] Apache Hadoop 3.3.1 release

2021-06-15 Thread Wei-Chiu Chuang
Hi All,

It gives me great pleasure to announce that the Apache Hadoop community has
voted to release Apache Hadoop 3.3.1.

This is the first stable release of Apache Hadoop 3.3.x line. It contains
697 bug fixes, improvements and enhancements since 3.3.0.

Users are encouraged to read the overview of major changes
<https://hadoop.apache.org/docs/r3.3.1/index.html> since 3.3.0. For details
of 697 bug fixes, improvements, and other enhancements since the previous
3.3.0 release, please check release notes
<http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/release/3.3.1/RELEASENOTES.3.3.1.html>
 and changelog
<http://hadoop.apache.org/docs/r3.3.1/hadoop-project-dist/hadoop-common/release/3.3.1/CHANGES.3.3.1.html>
detail
the changes since 3.3.0.

Many thanks to everyone who contributed to the release, and everyone in the
Apache Hadoop community! This release is a direct result of your great
contributions.

Many thanks to everyone who helped in this release process!

Many thanks to Sean Busbey, Chao Sun, Steve Loughran, Masatake Iwasaki,
Michael Stack, Viraj Jasani, Eric Payne, Ayush Saxena, Vinayakumar B,
Takanobu Asanuma, Xiaoqiao He and other folks who continued helps for this
release process.

Best Regards,
Wei-Chiu Chuang


Re: PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread Wei-Chiu Chuang
Have you checked the executor log?
In most cases the executor fails like that because of insufficient memory.
You should be able to see more details looking at the executor log.

On Thu, May 20, 2021 at 3:28 AM Clay McDonald <
stuart.mcdon...@bateswhite.com> wrote:

> Hello all,
>
>
>
> I’m hoping someone can give me some direction for troubleshooting this
> issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP
> cluster. I ssh directly to the first datanode and run PySpark with the
> following command; however, it is always failing no matter what size I set
> memory in Yarn Containers and Yarn Queues. Any suggestions?
>
>
>
>
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> --
>
>
>
> HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
>
> #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
>
> HDFS_OUT="/tmp"
>
> ENCODING="utf-16"
>
>
>
> fileList1=[
>
> 'Test _2003.txt'
>
> ]
>
> from  pyspark.sql.functions import regexp_replace,col
>
> for f in fileList1:
>
> fname=f
>
> fname_noext=fname.split('.')[0]
>
> df =
> spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
> header=True)
>
> lastcol=df.columns[-1]
>
> print('showing {}'.format(fname))
>
> if ('\r' in lastcol):
>
> lastcol=lastcol.replace('\r','')
>
> df=df.withColumn(lastcol,
> regexp_replace(col("{}\r".format(lastcol)), "[\r]",
> "")).drop('{}\r'.format(lastcol))
>
>
> df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))
>
>
>
>
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task
> 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5):
> ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
> Reason: Container marked as failed:
> container_e331_1621375512548_0021_01_06 on host:
> DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19
> 18:09:06.392]Container killed on request. Exit code is 143
> [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
> [2021-05-19 18:09:06.414]Killed by external signal
>
>
>
>
>
> THANKS! CLAY
>
>
>


Re: HDFS upgrade skip versions?

2020-12-15 Thread Wei-Chiu Chuang
Probably one of the protobuf incompatibility. Unfortunately we don't have
an open source tool to detect protobuf incompat.

A few related issues:

   1. HDFS-15700 
   2.
  1. HDFS-14726 
 2.
1. HDFS-13371 
 1. HDFS-15660 
  2. I know folks upgraded from 2.7 to 2.10 (LinkedIn?), and 2.8 to
  2.10 (Verizon Media).
  3. Searched JIRA I don't see a known bug between 2.6 and 2.10 in the
  DN heartbeat protobuf.


On Mon, Dec 14, 2020 at 10:57 PM Chad William Seys
 wrote:

> Hi all,
>Is it required or highly recommended that one not skip between HDFS
> (hadoop) versions?
>I tried skipping from 2.6 to 2.10 and it didn't work so well. :/
>Actually, I tested this with a tiny cluster and it worked, but on the
> production cluster the datanodes did not report blocks to the namenodes.
>   (Did report storage and connectivity otherwise.)
>
> Chad.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: [E] Re: Increased DN heap usage during Hadoop 3 upgrade

2020-10-06 Thread Wei-Chiu Chuang
Sorry for not being specific.
I was referring to HDFS-8791
<https://issues.apache.org/jira/browse/HDFS-8791> (block ID-based DN
storage layout can be very slow for datanode on ext4) where it is in 2.8
and above.

As I understand it, the increased heap usage only occurs during upgrade. No
issue afterwards.

My experience was based on CDH5 to CDH6 upgrade (Hadoop 2.6 -> Hadoop 3.0)
and HDP2 to HDP3 (Hadoop 2.7 -> Hadoop 3.1) upgrade. It is nearly
impossible to tell which commit increases heap usage worse during upgrade.



On Tue, Oct 6, 2020 at 3:01 PM Kihwal Lee  wrote:

> Which layout change are you referring to? The only layout change I know of
> was done in 2.7, IIRC. We backported that to 2.6 and did not see any
> adverse effects at that time.
>
> Is datanode using more heap all the time? Or is it running into trouble
> when generating full block reports?
>
> Kihwal
>
> On Mon, Oct 5, 2020 at 1:40 PM Wei-Chiu Chuang
>  wrote:
>
>> We experienced this issue on CDH6 and HDP3, so roughly Hadoop 3.0.x and
>> 3.1.x.
>> Hermanth experienced the same issue on Hadoop 3.1.1 as well (HDFS-15569
>> <
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_HDFS-2D15569&d=DwIBaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=b6gUZYewojO-9YMJdyeI_g&m=itpohwgKPN5qoauYyyMxhGSnasaP3LLbbMVezETEenA&s=kgWYVv2utuAyPWBhv0KVH8ZZGJqQBMvUM7dZ8J0jaa8&e=
>> >)
>>
>> On Mon, Oct 5, 2020 at 11:03 AM Igor Dvorzhak  wrote:
>>
>> > What Hadoop 3 version do you use?
>> >
>> > On Mon, Oct 5, 2020 at 10:03 AM Wei-Chiu Chuang 
>> > wrote:
>> >
>> >> I have anecdotally learned of multiple data points where during the
>> >> upgrading from Hadoop 2 to Hadoop 3, DN heap usage increases to the
>> point
>> >> where it goes OOM.
>> >>
>> >> Don't have much logs for this issue, but I suspect it's caused by the
>> >> layout change added in Hadoop 2.8.0.
>> >>
>> >> Does anyone else observe the same issue and how do you mitigate this?
>> For
>> >> now we suggested increasing DN heap size prior to upgrade as part of
>> >> pre-upgrade checklist.
>> >>
>> >> Thanks,
>> >> Wei-Chiu
>> >>
>> >
>>
>


Re: Increased DN heap usage during Hadoop 3 upgrade

2020-10-05 Thread Wei-Chiu Chuang
We experienced this issue on CDH6 and HDP3, so roughly Hadoop 3.0.x and
3.1.x.
Hermanth experienced the same issue on Hadoop 3.1.1 as well (HDFS-15569
<https://issues.apache.org/jira/browse/HDFS-15569>)

On Mon, Oct 5, 2020 at 11:03 AM Igor Dvorzhak  wrote:

> What Hadoop 3 version do you use?
>
> On Mon, Oct 5, 2020 at 10:03 AM Wei-Chiu Chuang 
> wrote:
>
>> I have anecdotally learned of multiple data points where during the
>> upgrading from Hadoop 2 to Hadoop 3, DN heap usage increases to the point
>> where it goes OOM.
>>
>> Don't have much logs for this issue, but I suspect it's caused by the
>> layout change added in Hadoop 2.8.0.
>>
>> Does anyone else observe the same issue and how do you mitigate this? For
>> now we suggested increasing DN heap size prior to upgrade as part of
>> pre-upgrade checklist.
>>
>> Thanks,
>> Wei-Chiu
>>
>


Increased DN heap usage during Hadoop 3 upgrade

2020-10-05 Thread Wei-Chiu Chuang
I have anecdotally learned of multiple data points where during the
upgrading from Hadoop 2 to Hadoop 3, DN heap usage increases to the point
where it goes OOM.

Don't have much logs for this issue, but I suspect it's caused by the
layout change added in Hadoop 2.8.0.

Does anyone else observe the same issue and how do you mitigate this? For
now we suggested increasing DN heap size prior to upgrade as part of
pre-upgrade checklist.

Thanks,
Wei-Chiu


[ANNOUNCEMENT] Apache Hadoop 2.9.x release line end of life

2020-09-07 Thread Wei-Chiu Chuang
The Apache Hadoop community has voted to end the release line of 2.9.x.
(Vote thread: https://s.apache.org/ApacheHadoop2.9EOLVote)

The first 2.9 release, 2.9.0 was released on 12/17/2017
The last 2.9 release, 2.9.2, was released on 1/19/2018

Existing 2.9.x users are encouraged to upgrade to newer release lines:
2.10.0 / 3.1.4 / 3.2.1 / 3.3.0.

Please check out our Release EOL wiki for details:
https://cwiki.apache.org/confluence/display/HADOOP/EOL+(End-of-life)+Release+Branches

Best Regards,
Wei-Chiu Chuang (On Behalf of the Apache Hadoop PMC)


Hadoop & ApacheCon

2020-09-01 Thread Wei-Chiu Chuang
Hello,

This year's ApacheCon will take place online between 9/29 and 10/1. There
are lots of sessions made by our fellow Hadoop developers:

https://apachecon.com/acah2020/tracks/bigdata-1.html
https://apachecon.com/acah2020/tracks/bigdata-2.html

In case you didn't realize, the registration is free, so be sure to check
them out!

Some of the talks that are closely related to Hadoop:

Apache Hadoop YARN: Past, Now and Future
Szilard Nemeth, Sunil Govindan

Hadoop Storage Reloaded: the 5 lessons Ozone learned from HDFS
Márton Elek

GDPR’s Right to be Forgotten in Apache Hadoop Ozone
Dinesh Chitlangia

Global File System View Across all Hadoop Compatible File Systems with the
LightWeight Client Side Mount Points.
Uma Maheswara Rao Gangumalla

Apache Hadoop YARN fs2cs: Converting Fair Scheduler to Capacity Scheduler
Peter Bacsko

HDFS Migration from 2.7 to 3.3 and enabling Router Based Federation (RBF)
in production
Akira Ajisaka

Stepping towards Bigdata on ARM
Vinayakumar B, Liu Sheng

I am sure I missed out others since I only looked at the Big Data tracks.
Feel free to add more if you want to promote your talk :)

Cheers
Weichiu


Re: [DISCUSS] fate of branch-2.9

2020-08-26 Thread Wei-Chiu Chuang
Bump up this thread after 6 months.

Is anyone still interested in the 2.9 release line? Or are we good to start
the EOL process? The 2.9.2 was released in Nov 2018.

I'd really like to see the community to converge to fewer release lines and
make more frequent releases in each line.

Thanks,
Weichiu


On Fri, Mar 6, 2020 at 5:47 PM Wei-Chiu Chuang  wrote:

> I think that's a great suggestion.
> Currently, we make 1 minor release per year, and within each minor release
> we bring up 1 thousand to 2 thousand commits in it compared with the
> previous one.
> I can totally understand it is a big bite for users to swallow. Having a
> more frequent release cycle, plus LTS and non-LTS releases should help with
> this. (Of course we will need to make the release preparation much easier,
> which is currently a pain)
>
> I am happy to discuss the release model further in the dev ML. LTS v.s.
> non-LTS is one suggestion.
>
> Another similar issue: In the past Hadoop strived to
> maintain compatibility. However, this is no longer sustainable as more CVEs
> coming from our dependencies: netty, jetty, jackson ... etc.
> In many cases, updating the dependencies brings breaking changes. More
> recently, especially in Hadoop 3.x, I started to make the effort to update
> dependencies much more frequently. How do users feel about this change?
>
> On Thu, Mar 5, 2020 at 7:58 AM Igor Dvorzhak 
> wrote:
>
>> Maybe Hadoop will benefit from adopting a similar release and support
>> strategy as Java? I.e. designate some releases as LTS and support them for
>> 2 (?) years (it seems that 2.7.x branch was de-facto LTS), other non-LTS
>> releases will be supported for 6 months (or until next release). This
>> should allow to reduce maintenance cost of non-LTS release and provide
>> conservative users desired stability by allowing them to wait for new LTS
>> release and upgrading to it.
>>
>> On Thu, Mar 5, 2020 at 1:26 AM Rupert Mazzucco 
>> wrote:
>>
>>> After recently jumping from 2.7.7 to 2.10 without issue myself, I vote
>>> for keeping only the 2.10 line.
>>> It would seem all other 2.x branches can upgrade to a 2.10.x easily if
>>> they feel like upgrading at all,
>>> unlike a jump to 3.x, which may require more planning.
>>>
>>> I also vote for having only one main 3.x branch. Why are there 3.1.x and
>>> 3.2.x seemingly competing,
>>> and now 3.3.x? For a community that does not have the resources to
>>> manage multiple release lines,
>>> you guys sure like to multiply release lines a lot.
>>>
>>> Cheers
>>> Rupert
>>>
>>> Am Mi., 4. März 2020 um 19:40 Uhr schrieb Wei-Chiu Chuang
>>> :
>>>
>>>> Forwarding the discussion thread from the dev mailing lists to the user
>>>> mailing lists.
>>>>
>>>> I'd like to get an idea of how many users are still on Hadoop 2.9.
>>>> Please share your thoughts.
>>>>
>>>> On Mon, Mar 2, 2020 at 6:30 PM Sree Vaddi
>>>>  wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Sent from Yahoo Mail on Android
>>>>>
>>>>>   On Mon, Mar 2, 2020 at 5:12 PM, Wei-Chiu Chuang
>>>>> wrote:   Hi,
>>>>>
>>>>> Following the discussion to end branch-2.8, I want to start a
>>>>> discussion
>>>>> around what's next with branch-2.9. I am hesitant to use the word "end
>>>>> of
>>>>> life" but consider these facts:
>>>>>
>>>>> * 2.9.0 was released Dec 17, 2017.
>>>>> * 2.9.2, the last 2.9.x release, went out Nov 19 2018, which is more
>>>>> than
>>>>> 15 months ago.
>>>>> * no one seems to be interested in being the release manager for 2.9.3.
>>>>> * Most if not all of the active Hadoop contributors are using Hadoop
>>>>> 2.10
>>>>> or Hadoop 3.x.
>>>>> * We as a community do not have the cycle to manage multiple release
>>>>> line,
>>>>> especially since Hadoop 3.3.0 is coming out soon.
>>>>>
>>>>> It is perhaps the time to gradually reduce our footprint in Hadoop
>>>>> 2.x, and
>>>>> encourage people to upgrade to Hadoop 3.x
>>>>>
>>>>> Thoughts?
>>>>>
>>>>>


Re: dfs.namenode.replication.min can set by client while reading/writing hdfs files

2020-07-17 Thread Wei-Chiu Chuang
It’s a system wide setting.

Yes it is configurable. No it is general a bad idea to change it for
anything other than 1.  Hadoop has not been properly tested with this value
set to 2 or above.

We really should update the description of this config and say “no, you
really don’t want to change it”

Upendra Yadav 於 2020年7月17日 週五,上午1:04寫道:

> Hi,
>
> Can I set dfs.namenode.replication.min on client side?
> or
> It can only be set on namenode?
>
> Is there any docs available that separates configs that we can use in:
> 1. namenode
> 2. datanode
> 3. hdfs client (for read and write hadoop files)
>
>


Re: Hadoop monitoring using Prometheus

2020-06-02 Thread Wei-Chiu Chuang
Check out HADOOP-16398 
It's a new feature in Hadoop 3.3.0

Akira might be able to help.

On Tue, Jun 2, 2020 at 5:56 PM ravi kanth  wrote:

> Hi Everyone,
>
> We have a production-ready cluster with 35 nodes that we are currently
> using. We are currently using System metrics using Prometheus + Grafana to
> visualize server. However, we are more interested in visualizing the Hadoop
> & Yarn service level metrics.
>
> I came across hadoop JMX port which exposes all the needed metrics from
> the service. However, I remained unsuccessful in tagging these metrics to
> prometheus jmx agent.
>
> Is there anyone who successfully got the JMX monitoring of Hadoop
> components work with Prometheus? Any help is greatly appreciated.
>
> Currently, we have scripts to parse the meaningful values out of JMX end
> point of the namenode & datanodes.
>
> Thanks In advance,
>
> Ravi
>
>


Re: How to identify active namenode?

2020-05-11 Thread Wei-Chiu Chuang
You can also check the namenode status through the namenode web UI/JMX.

On Sat, May 2, 2020 at 1:27 AM Ayush Saxena  wrote:

> Hi,
> Can you check :
>
> https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSHighAvailabilityWithQJM.html#Administrative_commands
>
> You can use [-getServiceState ]
>
> -Ayush
>
>
> On Sat, 2 May 2020 at 13:39, Debraj Manna 
> wrote:
>
>> I am using HDFS 2.6.0. I checked
>> https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#dfsadmin
>>  but
>> did not see any option for getting active namenode.
>>
>> I am looking for a command-line approach.
>>
>> On Sat, May 2, 2020 at 12:43 PM Ayush Saxena  wrote:
>>
>>> Hi Debraj,
>>> There is a command in haadmin -getAllServiceState, You can use that.
>>> Can read this for details :
>>>
>>> https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HDFSCommands.html#haadmin
>>> In the namenode UI also you can see the state of the namenode.
>>>
>>> -Ayush
>>>
>>>
>>> On Sat, 2 May 2020 at 12:32, Debraj Manna 
>>> wrote:
>>>
 Hi

 Can someone let me know how can I identify which is an active namenode?

 Thanks,




Re: Hadoop Storage community call

2020-04-01 Thread Wei-Chiu Chuang
Reminder -- this call is happening in about 2 hours.

Stay safe!
Weichiu

On Tue, Mar 24, 2020 at 5:44 PM Wei-Chiu Chuang  wrote:

> Hi!
>
> For the bi-weekly Hadoop Storage community next week, we'll do something
> different this time:
>
> Gabor Bota is going to talk about "404 Not Found" -- New and old issues
> with S3A; Current state and what's next.
>
> Stay tuned. Looking forward to seeing more cloud connector topics in the
> future.
>
> April 1st (Wednesday) US Pacific 10am, GMT 5pm, India: 10:30pm
>
>
> Please join via Zoom:
> https://cloudera.zoom.us/j/880548968
>
> Past meeting minutes
>
> https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit
>
>


Fwd: Hadoop Storage community call

2020-03-24 Thread Wei-Chiu Chuang
Forwarding this to the user mailing list since the topic may be more
interesting for users.

-- Forwarded message -
From: Wei-Chiu Chuang 
Date: Tue, Mar 24, 2020 at 5:44 PM
Subject: Hadoop Storage community call
To: Hadoop Common , Hdfs-dev <
hdfs-...@hadoop.apache.org>
Cc: Gabor Bota 


Hi!

For the bi-weekly Hadoop Storage community next week, we'll do something
different this time:

Gabor Bota is going to talk about "404 Not Found" -- New and old issues
with S3A; Current state and what's next.

Stay tuned. Looking forward to seeing more cloud connector topics in the
future.

April 1st (Wednesday) US Pacific 10am, GMT 5pm, India: 10:30pm


Please join via Zoom:
https://cloudera.zoom.us/j/880548968

Past meeting minutes
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit


[ANNOUNCEMENT] Apache Hadoop 2.8.x release line end of life

2020-03-09 Thread Wei-Chiu Chuang
The Apache Hadoop community has voted to end the release line of 2.8.x.
(Vote thread: https://s.apache.org/ApacheHadoop2.8EOLVote)

The first 2.8 release, 2.8.0 was released on 03/27/2017
The last 2.8 release, 2.8.5, was released on 09/15/2018

Existing 2.8.x users are encouraged to upgrade to newer release lines:
2.10.0 / 3.1.3 / 3.2.1.

Please check out our Release EOL wiki for details:
https://cwiki.apache.org/confluence/display/HADOOP/EOL+(End-of-life)+Release+Branches

Best Regards,
Wei-Chiu Chuang (On Behalf of the Apache Hadoop PMC)


Re: [DISCUSS] fate of branch-2.9

2020-03-06 Thread Wei-Chiu Chuang
I think that's a great suggestion.
Currently, we make 1 minor release per year, and within each minor release
we bring up 1 thousand to 2 thousand commits in it compared with the
previous one.
I can totally understand it is a big bite for users to swallow. Having a
more frequent release cycle, plus LTS and non-LTS releases should help with
this. (Of course we will need to make the release preparation much easier,
which is currently a pain)

I am happy to discuss the release model further in the dev ML. LTS v.s.
non-LTS is one suggestion.

Another similar issue: In the past Hadoop strived to
maintain compatibility. However, this is no longer sustainable as more CVEs
coming from our dependencies: netty, jetty, jackson ... etc.
In many cases, updating the dependencies brings breaking changes. More
recently, especially in Hadoop 3.x, I started to make the effort to update
dependencies much more frequently. How do users feel about this change?

On Thu, Mar 5, 2020 at 7:58 AM Igor Dvorzhak  wrote:

> Maybe Hadoop will benefit from adopting a similar release and support
> strategy as Java? I.e. designate some releases as LTS and support them for
> 2 (?) years (it seems that 2.7.x branch was de-facto LTS), other non-LTS
> releases will be supported for 6 months (or until next release). This
> should allow to reduce maintenance cost of non-LTS release and provide
> conservative users desired stability by allowing them to wait for new LTS
> release and upgrading to it.
>
> On Thu, Mar 5, 2020 at 1:26 AM Rupert Mazzucco 
> wrote:
>
>> After recently jumping from 2.7.7 to 2.10 without issue myself, I vote
>> for keeping only the 2.10 line.
>> It would seem all other 2.x branches can upgrade to a 2.10.x easily if
>> they feel like upgrading at all,
>> unlike a jump to 3.x, which may require more planning.
>>
>> I also vote for having only one main 3.x branch. Why are there 3.1.x and
>> 3.2.x seemingly competing,
>> and now 3.3.x? For a community that does not have the resources to manage
>> multiple release lines,
>> you guys sure like to multiply release lines a lot.
>>
>> Cheers
>> Rupert
>>
>> Am Mi., 4. März 2020 um 19:40 Uhr schrieb Wei-Chiu Chuang
>> :
>>
>>> Forwarding the discussion thread from the dev mailing lists to the user
>>> mailing lists.
>>>
>>> I'd like to get an idea of how many users are still on Hadoop 2.9.
>>> Please share your thoughts.
>>>
>>> On Mon, Mar 2, 2020 at 6:30 PM Sree Vaddi
>>>  wrote:
>>>
>>>> +1
>>>>
>>>> Sent from Yahoo Mail on Android
>>>>
>>>>   On Mon, Mar 2, 2020 at 5:12 PM, Wei-Chiu Chuang
>>>> wrote:   Hi,
>>>>
>>>> Following the discussion to end branch-2.8, I want to start a discussion
>>>> around what's next with branch-2.9. I am hesitant to use the word "end
>>>> of
>>>> life" but consider these facts:
>>>>
>>>> * 2.9.0 was released Dec 17, 2017.
>>>> * 2.9.2, the last 2.9.x release, went out Nov 19 2018, which is more
>>>> than
>>>> 15 months ago.
>>>> * no one seems to be interested in being the release manager for 2.9.3.
>>>> * Most if not all of the active Hadoop contributors are using Hadoop
>>>> 2.10
>>>> or Hadoop 3.x.
>>>> * We as a community do not have the cycle to manage multiple release
>>>> line,
>>>> especially since Hadoop 3.3.0 is coming out soon.
>>>>
>>>> It is perhaps the time to gradually reduce our footprint in Hadoop 2.x,
>>>> and
>>>> encourage people to upgrade to Hadoop 3.x
>>>>
>>>> Thoughts?
>>>>
>>>>


Re: [DISCUSS] fate of branch-2.9

2020-03-06 Thread Wei-Chiu Chuang
Thanks for sharing your upgrade experience! That's great to hear.

I can't speak for others but I try to keep 3.1 as good as possible. My
employer has much more interest in 3.x release lines.
Also, given that we typically make one release each year and that a large
installation typically runs on a certain release line for 2-3 years or
longer, I think it is reasonable to see up to 3 3.x release lines.

On Thu, Mar 5, 2020 at 1:26 AM Rupert Mazzucco 
wrote:

> After recently jumping from 2.7.7 to 2.10 without issue myself, I vote for
> keeping only the 2.10 line.
> It would seem all other 2.x branches can upgrade to a 2.10.x easily if
> they feel like upgrading at all,
> unlike a jump to 3.x, which may require more planning.
>
> I also vote for having only one main 3.x branch. Why are there 3.1.x and
> 3.2.x seemingly competing,
> and now 3.3.x? For a community that does not have the resources to manage
> multiple release lines,
> you guys sure like to multiply release lines a lot.
>
> Cheers
> Rupert
>
> Am Mi., 4. März 2020 um 19:40 Uhr schrieb Wei-Chiu Chuang
> :
>
>> Forwarding the discussion thread from the dev mailing lists to the user
>> mailing lists.
>>
>> I'd like to get an idea of how many users are still on Hadoop 2.9.
>> Please share your thoughts.
>>
>> On Mon, Mar 2, 2020 at 6:30 PM Sree Vaddi 
>> wrote:
>>
>>> +1
>>>
>>> Sent from Yahoo Mail on Android
>>>
>>>   On Mon, Mar 2, 2020 at 5:12 PM, Wei-Chiu Chuang
>>> wrote:   Hi,
>>>
>>> Following the discussion to end branch-2.8, I want to start a discussion
>>> around what's next with branch-2.9. I am hesitant to use the word "end of
>>> life" but consider these facts:
>>>
>>> * 2.9.0 was released Dec 17, 2017.
>>> * 2.9.2, the last 2.9.x release, went out Nov 19 2018, which is more than
>>> 15 months ago.
>>> * no one seems to be interested in being the release manager for 2.9.3.
>>> * Most if not all of the active Hadoop contributors are using Hadoop 2.10
>>> or Hadoop 3.x.
>>> * We as a community do not have the cycle to manage multiple release
>>> line,
>>> especially since Hadoop 3.3.0 is coming out soon.
>>>
>>> It is perhaps the time to gradually reduce our footprint in Hadoop 2.x,
>>> and
>>> encourage people to upgrade to Hadoop 3.x
>>>
>>> Thoughts?
>>>
>>>


Re: [DISCUSS] fate of branch-2.9

2020-03-04 Thread Wei-Chiu Chuang
Forwarding the discussion thread from the dev mailing lists to the user
mailing lists.

I'd like to get an idea of how many users are still on Hadoop 2.9.
Please share your thoughts.

On Mon, Mar 2, 2020 at 6:30 PM Sree Vaddi 
wrote:

> +1
>
> Sent from Yahoo Mail on Android
>
>   On Mon, Mar 2, 2020 at 5:12 PM, Wei-Chiu Chuang
> wrote:   Hi,
>
> Following the discussion to end branch-2.8, I want to start a discussion
> around what's next with branch-2.9. I am hesitant to use the word "end of
> life" but consider these facts:
>
> * 2.9.0 was released Dec 17, 2017.
> * 2.9.2, the last 2.9.x release, went out Nov 19 2018, which is more than
> 15 months ago.
> * no one seems to be interested in being the release manager for 2.9.3.
> * Most if not all of the active Hadoop contributors are using Hadoop 2.10
> or Hadoop 3.x.
> * We as a community do not have the cycle to manage multiple release line,
> especially since Hadoop 3.3.0 is coming out soon.
>
> It is perhaps the time to gradually reduce our footprint in Hadoop 2.x, and
> encourage people to upgrade to Hadoop 3.x
>
> Thoughts?
>
>


[通知] 建立 user-zh 郵件列表

2020-02-28 Thread Wei-Chiu Chuang
大家好
Apache Hadoop欢迎来自世界各地的参与者。我很高兴代表Hadoop PMC宣布我们创建了user...@hadoop.apache.org
邮件列表。

这个邮件列表的目的是给中文(简体/正体)用户询问关于Apache Hadoop的问题。欢迎善于使用中文的参与者利用此列表发问。

过去数年间我们发现在中国的本地用户社区蓬勃成长。包括去年第一次在北京由Apache Hadoop
PMC举行的Hadoop社区聚会。我们希望借由创建一个中文友善的邮件列表能使Apache
Hadoop成为一个更加多元的社区。我们也希望借此能使中文用户以Apache Way交流,并使它成为中文世界用户社区与世界用户社区之间的桥梁。

请注意虽然user-zh邮件列表是为中文用户讨论而设立,开发的讨论包括设计及代码修改仍应以英文在 *-dev@, JIRAs and
GitHub上进行。

这个邮件列表目前已经可以使用了,我们的网站也将再更新后加入此邮件列表。任何人都可借由发信至
user-zh-subscr...@hadoop.apache.org 订阅此列表。非订阅者的信件经审核后可发出。

- 庄伟赳(Apache Hadoop PMC代表)


大家好

Apache Hadoop歡迎來自世界各地的參與者。我很高興代表Hadoop PMC宣布我們建立了user...@hadoop.apache.org
郵件列表。

這個郵件列表的目的是給中文(簡體/正體)使用者詢問關於Apache Hadoop的問題。歡迎善於使用中文的參與者利用此列表發問。

過去數年間我們發現在中國的本地使用者社群蓬勃成長。包括去年第一次在北京由Apache Hadoop
PMC舉行的Hadoop社群聚會。我們希望藉由建立一個中文友善的郵件列表能使Apache
Hadoop成為一個更加多元的社群。我們也希望藉此能使中文使用者以Apache Way交流,並使它成為中文世界使用者社群與世界使用者社群之間的橋樑。


請注意雖然user-zh郵件列表是為中文使用者討論而設立,開發的討論包括設計及程式碼修改仍應以英文在 *-dev@, JIRAs and
GitHub上進行。

這個郵件列表目前已經可以使用了,我們的網站也將再更新後加入此郵件列表。任何人都可藉由發信至
user-zh-subscr...@hadoop.apache.org 訂閱此列表。非訂閱者的信件經審核後可發出。

- 莊偉赳(Apache Hadoop PMC代表)

On Fri, Feb 28, 2020 at 9:30 AM Wei-Chiu Chuang  wrote:

> Hi!
>
> Apache Hadoop welcomes contributors from around the world. On behalf of
> the Hadoop PMC, I am pleased to announce the creation of a new mailing list
> user...@hadoop.apache.org.
>
> The intent of this mailing list is to act as a place for users to ask
> questions about Apache Hadoop in Chinese (Traditional/Simplified).
> Individuals who feel more comfortable communicating in Chinese should feel
> welcome to ask questions in Chinese on this list.
>
> Over the past few years we have observed a healthy growing local user
> local community in China. Evidence include the first ever Hadoop
> Community Meetup
> <https://blogs.apache.org/hadoop/entry/hadoop-community-meetup-beijing-aug> in
> Beijing hosted by the Apache Hadoop PMC last year. We hope that by creating
> a Mandarin Chinese friendly mailing list will make the Apache Hadoop
> project a more diverse community. We also hope that by doing so, will make
> the Chinese users to operate in the Apache Way, and to serve as a bridge
> between the local user community with the global community.
>
> Please note that while the user-zh mailing list is set up for user
> discussions in Mandarin, development discussions such as design and code
> changes should still go to *-dev@, JIRAs and GitHub in English as is.
>
> The mailing list is live as of now and the website
> <https://hadoop.apache.org/mailing_lists.html> will be updated shortly to
> include the mailing list. Any one can subscribe to this list by sending an
> email to user-zh-subscr...@hadoop.apache.org. Non-subscribers may also
> post messages after the moderators' approvals.
>
> - Wei-Chiu Chuang (on behalf of the Apache Hadoop PMC)
>


[ANNOUNCE] Creation of user-zh mailing list

2020-02-28 Thread Wei-Chiu Chuang
Hi!

Apache Hadoop welcomes contributors from around the world. On behalf of the
Hadoop PMC, I am pleased to announce the creation of a new mailing list
user...@hadoop.apache.org.

The intent of this mailing list is to act as a place for users to ask
questions about Apache Hadoop in Chinese (Traditional/Simplified).
Individuals who feel more comfortable communicating in Chinese should feel
welcome to ask questions in Chinese on this list.

Over the past few years we have observed a healthy growing local user local
community in China. Evidence include the first ever Hadoop Community Meetup
<https://blogs.apache.org/hadoop/entry/hadoop-community-meetup-beijing-aug> in
Beijing hosted by the Apache Hadoop PMC last year. We hope that by creating
a Mandarin Chinese friendly mailing list will make the Apache Hadoop
project a more diverse community. We also hope that by doing so, will make
the Chinese users to operate in the Apache Way, and to serve as a bridge
between the local user community with the global community.

Please note that while the user-zh mailing list is set up for user
discussions in Mandarin, development discussions such as design and code
changes should still go to *-dev@, JIRAs and GitHub in English as is.

The mailing list is live as of now and the website
<https://hadoop.apache.org/mailing_lists.html> will be updated shortly to
include the mailing list. Any one can subscribe to this list by sending an
email to user-zh-subscr...@hadoop.apache.org. Non-subscribers may also post
messages after the moderators' approvals.

- Wei-Chiu Chuang (on behalf of the Apache Hadoop PMC)


Re: How do I validate Data Encryption on Block data transfer?

2020-02-05 Thread Wei-Chiu Chuang
I don't know the answer to the question off the top of my head. Tracking
the source code, it looks like the data transfer encryption does not really
depend on Kerberos.

That said,
(1) the Hadoop data transfer encryption relies on the data encryption key
distributed by the NameNode. If a client can't validate authenticity of the
NameNode, it may not make much sense to encrypt.
(2) (With my Cloudera's hat on) If you use CDH, CM warns you that
encryption is not effective if kerberos is not on, and which means this
configuration is unsupported by Cloudera.

You can use packet sniffer to validate the encryption.

On Tue, Feb 4, 2020 at 3:44 PM Daniel Howard  wrote:

> Hello,
>
> My scenario is running Hadoop in an environment without multiple users in
> a secure datacenter. Nevertheless, we prefer to have encrypted data
> transfers for activity between nodes. We have determined that we do not
> need to set up Kerberos, so I am working through getting encryption going
> on block data transfer and web services.
>
> I appear to have DFS encryption enabled thanks to the following settings
> in *hdfs-site.xml*:
> 
>   
> dfs.encrypt.data.transfer
> true
>   
>   
> dfs.block.access.token.enable
> true
>   
>
> Indeed, I was getting handshake errors on the datanodes with
> dfs.encrypt.data.transfer enabled until I also set
> dfs.block.access.token.enable.
>
> Filesystem operations work great now, but I still see plenty of this:
>
> 2020-02-04 15:25:59,492 INFO sasl.SaslDataTransferClient: SASL encryption
> trust check: localHostTrusted = false, remoteHostTrusted = false
> 2020-02-04 15:25:59,862 INFO sasl.SaslDataTransferClient: SASL encryption
> trust check: localHostTrusted = false, remoteHostTrusted = false
> 2020-02-04 15:26:00,054 INFO sasl.SaslDataTransferClient: SASL encryption
> trust check: localHostTrusted = false, remoteHostTrusted = false
>
> I reckon that SASL is a Kerberos feature that I shouldn't ever expect to
> see reported as true. Does that sound right?
>
> Is there a way to verify that DFS is encrypting data between nodes? (I
> could get a sniffer out...)
>
> Thanks,
> -danny
>
> --
> http://dannyman.toldme.com
>


Re: Understanding the relationship between block size and RPC / IPC length?

2019-11-08 Thread Wei-Chiu Chuang
There are more details in this jira:
https://issues.apache.org/jira/browse/HADOOP-16452

Denser DataNodes are common. It is not uncommon to find a DataNode with > 7
> million blocks these days.
> With such a high number of blocks, the block report message can exceed the
> 64mb limit (defined by ipc.maximum.data.length). The block reports are
> rejected, causing missing blocks in HDFS. We had to double this
> configuration value in order to work around the issue.


On Fri, Nov 8, 2019 at 1:48 AM Carey, Paul 
wrote:

> Hi
>
>
>
> The NameNode logs in my HDFS instance recently started logging warnings of
> the form `Requested data length 145530837 is longer than maximum configured
> RPC length 144217728`.
>
>
>
> This ultimately manifested itself as the NameNode declaring thousands of
> blocks to be missing and 19 files to be corrupt.
>
>
>
> The situation was resolved by updating `ipc.maximum.data.length` to a
> value greater than the requested data length listed above. This is not a
> satisfying resolution though. I'd like to understand how this issue
> occurred.
>
>
>
> I've run `hdfs fsck -files -blocks -locations` and the largest block is of
> length `1342177728`.
>
>
>
> - Is there some overhead for RPC calls? Could a block of length
> `1342177728` be resulting in the original warning log at the top of this
> post?
>
> - My understanding is that the only way a client writing to HDFS can
> specify a block size is via either `-Ddfs.blocksize` or setting the
> corresponding property on the `Configuration` object when initialising the
> HDFS connection. Is this correct, or are there any other routes to creating
> excessively large blocks?
>
> - Other than overly large blocks, are there any other issues that could
> trigger the warning above?
>
>
>
> Many thanks
>
>
>
> Paul
>
> This email and any attachments are confidential and may also be
> privileged. If you are not the intended recipient, please delete all copies
> and notify the sender immediately. You may wish to refer to the
> incorporation details of Standard Chartered PLC, Standard Chartered Bank
> and their subsidiaries at https://www.sc.com/en/our-locations. Please
> refer to https://www.sc.com/en/privacy-policy/ for Standard Chartered
> Bank’s Privacy Policy.
>


Fwd: This week's Hadoop storage community online sync

2019-10-28 Thread Wei-Chiu Chuang
Normally I don't spam the online sync announcements in the Hadoop user
mailing alias. But this week's topic is more useful to Hadoop
users/administrators.
If you can't join this Tuesday evening's meetup at Yahoo, Yiqun graciously
agreed to give the same talk over the wire. See you there!

Best,
Weichiu

-- Forwarded message -
From: Wei-Chiu Chuang 
Date: Mon, Oct 28, 2019 at 7:41 PM
Subject: This week's Hadoop storage community online sync
To: Hdfs-dev , Hadoop Common <
common-...@hadoop.apache.org>


Hello, I am super stoked to have Yiqun Lin with us this Wednesday morning
Oct 30 US Pacific 10am/CET (Budapest) 6pm/ IST (Banglore) 10:30pm/ CST
(Beijing) Oct 31 1am / JST (Tokyo) 2am to talk about “HDFS Cluster
Optimization in eBay” — Yiqun happens to be in the bay area this week and
this is the same talk that he is going to present Tuesday night at Yahoo
this week.

HDFS Cluster Optimization in eBay



Yiqun Lin, Hadoop Team, eBay + Apache Hadoop Committer / PMC member
> On eBay, we have many large HDFS clusters with thousands of nodes. We face
> many stability/data availability problems in our cluster. Today we want to
> share some optimizations we did in the system layer or HDFS level to
> improve our clusters. Besides, that makes our cluster more stable than
> before.


Past meeting notes and zoom link:
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit?usp=sharing

Best,
Weichiu


Hadoop meetup at Yahoo this Tuesday evening

2019-10-28 Thread Wei-Chiu Chuang
Hi,
I don't think this meetup information is shared in the user mailing list,
so here it is:
https://www.meetup.com/hadoop/events/265963792


Join us at Yahoo’s HQ for awesome presentations (Uber, eBay, Cloudera,
Yahoo/Verizon Media), conversations, & networking! Pizza & refreshments
will be served!

[Location & Parking]

Yahoo Campus, 701 1st Ave, Sunnyvale (Building C, Classroom 4)

Please park in the garage attached to Building C, on the 3rd floor.

[Agenda]

5 - 5:45
Pizza, cookies, refreshments, & networking

5:45 - 6
Welcome & Intros

6 - 6:45
Raising the performance bar for stream processing with Apache Storm 2.0
Roshan Naik, Lead - Real-time Compute Platform, Uber

The effort to rearchitect Storm's core engine was born from the observation
that there exists a significant gap between hardware capabilities and the
performance of the best streaming engines. In this talk, we’ll take a look
at the performance and architecture of Storm's new engine which features a
leaner threading model, a lock free messaging subsystem and a new
ultra-lightweight Back Pressure model.

6:45 - 7:15
Quick Intro to Maha: Open source framework for rapid reporting API
development; with out of the box support for high cardinality dimension
lookups with Druid
Pranav Bhole, Sr Software Engineer, Verizon Media

7:15 - 7:45
HDFS Cluster Optimization in eBay
Yiqun Lin, Hadoop Team, eBay + Apache Hadoop Committer / PMC member

On eBay, we have many large HDFS clusters with thousands of nodes. We face
many stability/data availability problems in our cluster. Today we want to
share some optimizations we did in the system layer or HDFS level to
improve our clusters. Besides, that makes our cluster more stable than
before.

7:45 - 8:15
Ozone - Object Storage for Big Data
Arpit Agarwal, Senior Engineering Manager - Storage Team, Cloudera

Ozone is an Object Store for big data that is designed to keep the best
parts of HDFS while scaling to billions of files. Ozone is designed to
support the Hadoop ecosystem with applications like MapReduce, Hive, Spark,
and Impala working out of the box. This talk gives an overview of the Ozone
architecture and describes how we approached solving some of the scale
limitations of HDFS. We will also look at the current state and future
roadmap.

8:15 - 8:35
Storm 2.0 - Features and Performance Enhancements
Kishor Patil, Principal Software Engineer, Verizon Media + Apache Storm PMC


Re: Hadoop and OpenSSL 1.1.1

2019-10-09 Thread Wei-Chiu Chuang
Filed HADOOP-16647 <https://issues.apache.org/jira/browse/HADOOP-16647> I
am not planning to work on this any time soon so if any one is interested
feel free to pick it up/supply additional information.

On Wed, Oct 9, 2019 at 9:19 AM Wei-Chiu Chuang  wrote:

> Ok I stand corrected.
>
> That was for OpenSSL 1.1.0 and it might not even work for 1.1.1. The
> OpenSSL release version doesn't imply backward compatibility.
> Would you please try it out and let me know if HADOOP-14597 works or not?
> If not we need to file a jira to track this because OpenSSL 1.1.0 has
> already EOL'd and we need to look into support 1.1.1 (another 4 years until
> EOL)
> https://www.openssl.org/policies/releasestrat.html
>
> On Wed, Oct 9, 2019 at 8:55 AM Wei-Chiu Chuang 
> wrote:
>
>> See https://issues.apache.org/jira/browse/HADOOP-14597
>> OpenSSL 1.1.0 is supported with Hadoop 3.
>> We should backport this in Hadoop 2.
>>
>> I don't recall if we ever documented openssl version supported. Would be
>> nice to add that too.
>>
>> On Wed, Oct 9, 2019 at 12:09 AM Gonzalo Gomez 
>> wrote:
>>
>>> Hi, any comment regarding Hadoop and OpenSSL 1.1.1?
>>>
>>> On Fri, Oct 4, 2019 at 10:50 AM Gonzalo Gomez 
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>>
>>>> As OpenSSL 1.0.2 EOL is getting closer (end of this year [1]) I tried
>>>> to run Hadoop with OpenSSL 1.1.1d, but running the checknative command
>>>> gives false for the openssl library.
>>>>
>>>>
>>>> openssl: false EVP_CIPHER_CTX_cleanup
>>>>
>>>>
>>>> If I downgrade OpenSSL to 1.0.2s, it shows true and the path to
>>>> libcrypto.so library on my system. I searched for the
>>>> EVP_CIPHER_CTX_cleanup function and according to [2] it was removed on
>>>> OpenSSL 1.1.0. Can you tell me if Hadoop supports OpenSSL 1.1.1 or if it is
>>>> planned to be supported at any time in the future?
>>>>
>>>>
>>>> Regards,
>>>>
>>>> Gonzalo
>>>>
>>>>
>>>> [1] https://www.openssl.org/policies/releasestrat.html
>>>>
>>>> [2] https://stackoverflow.com/a/39762336
>>>>
>>>> --
>>>> Run your favorite apps in the cloud with Bitnami
>>>> Confidential - All Rights Reserved.
>>>> BitRock © 2019
>>>>
>>>
>>>
>>> --
>>> Run your favorite apps in the cloud with Bitnami
>>> Confidential - All Rights Reserved.
>>> BitRock © 2019
>>>
>>


Re: Hadoop and OpenSSL 1.1.1

2019-10-09 Thread Wei-Chiu Chuang
Ok I stand corrected.

That was for OpenSSL 1.1.0 and it might not even work for 1.1.1. The
OpenSSL release version doesn't imply backward compatibility.
Would you please try it out and let me know if HADOOP-14597 works or not?
If not we need to file a jira to track this because OpenSSL 1.1.0 has
already EOL'd and we need to look into support 1.1.1 (another 4 years until
EOL)
https://www.openssl.org/policies/releasestrat.html

On Wed, Oct 9, 2019 at 8:55 AM Wei-Chiu Chuang  wrote:

> See https://issues.apache.org/jira/browse/HADOOP-14597
> OpenSSL 1.1.0 is supported with Hadoop 3.
> We should backport this in Hadoop 2.
>
> I don't recall if we ever documented openssl version supported. Would be
> nice to add that too.
>
> On Wed, Oct 9, 2019 at 12:09 AM Gonzalo Gomez  wrote:
>
>> Hi, any comment regarding Hadoop and OpenSSL 1.1.1?
>>
>> On Fri, Oct 4, 2019 at 10:50 AM Gonzalo Gomez 
>> wrote:
>>
>>> Hi all,
>>>
>>>
>>> As OpenSSL 1.0.2 EOL is getting closer (end of this year [1]) I tried to
>>> run Hadoop with OpenSSL 1.1.1d, but running the checknative command gives
>>> false for the openssl library.
>>>
>>>
>>> openssl: false EVP_CIPHER_CTX_cleanup
>>>
>>>
>>> If I downgrade OpenSSL to 1.0.2s, it shows true and the path to
>>> libcrypto.so library on my system. I searched for the
>>> EVP_CIPHER_CTX_cleanup function and according to [2] it was removed on
>>> OpenSSL 1.1.0. Can you tell me if Hadoop supports OpenSSL 1.1.1 or if it is
>>> planned to be supported at any time in the future?
>>>
>>>
>>> Regards,
>>>
>>> Gonzalo
>>>
>>>
>>> [1] https://www.openssl.org/policies/releasestrat.html
>>>
>>> [2] https://stackoverflow.com/a/39762336
>>>
>>> --
>>> Run your favorite apps in the cloud with Bitnami
>>> Confidential - All Rights Reserved.
>>> BitRock © 2019
>>>
>>
>>
>> --
>> Run your favorite apps in the cloud with Bitnami
>> Confidential - All Rights Reserved.
>> BitRock © 2019
>>
>


Re: Hadoop and OpenSSL 1.1.1

2019-10-09 Thread Wei-Chiu Chuang
See https://issues.apache.org/jira/browse/HADOOP-14597
OpenSSL 1.1.0 is supported with Hadoop 3.
We should backport this in Hadoop 2.

I don't recall if we ever documented openssl version supported. Would be
nice to add that too.

On Wed, Oct 9, 2019 at 12:09 AM Gonzalo Gomez  wrote:

> Hi, any comment regarding Hadoop and OpenSSL 1.1.1?
>
> On Fri, Oct 4, 2019 at 10:50 AM Gonzalo Gomez  wrote:
>
>> Hi all,
>>
>>
>> As OpenSSL 1.0.2 EOL is getting closer (end of this year [1]) I tried to
>> run Hadoop with OpenSSL 1.1.1d, but running the checknative command gives
>> false for the openssl library.
>>
>>
>> openssl: false EVP_CIPHER_CTX_cleanup
>>
>>
>> If I downgrade OpenSSL to 1.0.2s, it shows true and the path to
>> libcrypto.so library on my system. I searched for the
>> EVP_CIPHER_CTX_cleanup function and according to [2] it was removed on
>> OpenSSL 1.1.0. Can you tell me if Hadoop supports OpenSSL 1.1.1 or if it is
>> planned to be supported at any time in the future?
>>
>>
>> Regards,
>>
>> Gonzalo
>>
>>
>> [1] https://www.openssl.org/policies/releasestrat.html
>>
>> [2] https://stackoverflow.com/a/39762336
>>
>> --
>> Run your favorite apps in the cloud with Bitnami
>> Confidential - All Rights Reserved.
>> BitRock © 2019
>>
>
>
> --
> Run your favorite apps in the cloud with Bitnami
> Confidential - All Rights Reserved.
> BitRock © 2019
>


Re: Is shortcircuit-read (SCR) really fast?

2019-09-04 Thread Wei-Chiu Chuang
Hi Daegyu,
let's move this discussion to the user group, so that any one else can
comment on this. I obviously don't have the best answers to the questions.
But these are great questions.

Re: benchmarks for SCR:
I believe yes. In fact, I found a benchmark running Accumulo and HBase on
HDFS
http://accumulosummit.com/2015/program/talks/hdfs-short-circuit-local-read-performance-benchmarking-with-apache-accumulo-and-apache-hbase/
However, HDFS SCR is a very old feature, and since this isn't new, there is
less interest in performing the same benchmarks. So I don't expect to see
new benchmarks.

As far as I know I've not received reports regarding SCR performance
regression for users running HBase or Impala (these two applications
typically have SCR enabled).

A colleague of mine, Nicolae (CC'ed here) is also doing a similar benchmark
with NVMe SSD. I believe Nicolae will be interested in how you pushed HDSF
to the hardware limit. IIRC the theoretical limit of DataNode to client is
about 500MB/s per client.

On Fri, Aug 30, 2019 at 11:49 PM Daegyu Han  wrote:

> Sorry for late reply.
>
> I used my own benchmark using HDFS api.
>
> The cluster environment I used is as follows:
>
> Hadoop 2.9.2
>
> Samsung nvme ssd
>
> Configured hdfs block size to 1GB.
>
> I used only one datanode to prevent remote read.
>
>
> First, I uploaded 1Gb file to hdfs.
>
> Then, I ran my benchmark code.
>
> I added some log to my hdfs code so I can see each method runtime.
>
>
> Anyway, Has there been any performance evaluation by companies using HDFS
> on SCR and legacy read?
>
>
> As far as I know, legacy read goes through datanode so it induce many
> sendfile system call and tcp socket opening overhead.
>
>
> Intuitively, I think SCR that client directly read file should be faster
> than legacy read.
>
> However, the first step which is requesting file is synchronous and can be
> overhead when using fast nvme ssd.
>
>
> What do you think?
>
>
> Thank you
>
>
>
> 2019년 8월 30일 (금) 22:27, Wei-Chiu Chuang 님이 작성:
>
>> Interesting benchmark. Thank you, Daegyu.
>> Can you try a larger file too? Like 128mb or 1gb? HDFS is not optimized
>> for smaller files.
>>
>> What did you use for benchmark?
>>
>> Daegyu Han 於 2019年8月29日 週四,下午11:40寫道:
>>
>>> Hi all,
>>>
>>> Is ShortCircuit read faster than legacy read which goes through data
>>> nodes?
>>>
>>> I have evaluated SCR and legacy local read on both HDD and NVMe SSD.
>>> However, I have not seen any results that SCR is faster than  legacy.
>>>
>>> Rather, SCR was slower than legacy when using NVMe SSDs because of the
>>> operation that initially to get the file descriptors.
>>>
>>> When I configured SCR, getBlockReader() elapsed time is slower than
>>> legacy local read.
>>>
>>> When I used NVMe SSD,
>>> I also found that DFSInputStream: dataIn.read() time is really similar
>>> to hardware limit.
>>> (8MB/0.00289sec) = 2800MB/s
>>>
>>> I checked the logs that the execution time measured by the application
>>> took 5ms to process 8mb.
>>>
>>> There is a 3 ms runtime difference between blockReader.doRead () in
>>> DFSInputStream.java and dataIn.read () in BlockReader.java.
>>> Where is this 3ms difference from?
>>>
>>> Thank you
>>> ᐧ
>>>
>>


Re: Is shortcircuit-read (SCR) really fast?

2019-08-30 Thread Wei-Chiu Chuang
Interesting benchmark. Thank you, Daegyu.
Can you try a larger file too? Like 128mb or 1gb? HDFS is not optimized for
smaller files.

What did you use for benchmark?

Daegyu Han 於 2019年8月29日 週四,下午11:40寫道:

> Hi all,
>
> Is ShortCircuit read faster than legacy read which goes through data nodes?
>
> I have evaluated SCR and legacy local read on both HDD and NVMe SSD.
> However, I have not seen any results that SCR is faster than  legacy.
>
> Rather, SCR was slower than legacy when using NVMe SSDs because of the
> operation that initially to get the file descriptors.
>
> When I configured SCR, getBlockReader() elapsed time is slower than legacy
> local read.
>
> When I used NVMe SSD,
> I also found that DFSInputStream: dataIn.read() time is really similar to
> hardware limit.
> (8MB/0.00289sec) = 2800MB/s
>
> I checked the logs that the execution time measured by the application
> took 5ms to process 8mb.
>
> There is a 3 ms runtime difference between blockReader.doRead () in
> DFSInputStream.java and dataIn.read () in BlockReader.java.
> Where is this 3ms difference from?
>
> Thank you
> ᐧ
>


Re: Hadoop storage community online sync

2019-08-21 Thread Wei-Chiu Chuang
We had a great turnout today, thanks to Konstantin for leading the
discussion of the NameNode Fine-Grained Locking proposal.

There were at least 16 participants joined the call.

Today's summary can be found here:
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit#

8/19/2019

We are moving the sync to 10AM US PDT!

NameNode Fine-Grained Locking via InMemory Namespace Partitioning

Attendee:

Konstantin, Chen, Weichiu, Xiaoyu, Anu, Matt, pljeliazkov, Chao Sun, Clay,
Bharat Viswanadham, Matt, Craig Condit, Matthew Sharp, skumpf, Artem
Ervits, Mohammad J Khan, Nanda, Alex Moundalexis.

Konstantin lead the discussion of HDFS-14703
<https://issues.apache.org/jira/browse/HDFS-14703>.

There are three important parts:

(1) Partition namespace into multiple GSet, different part of namespace can
be processed in parallel.

(2) INode Key

(3) Latch lock

How to support snapshot —> should be able to get partitioned similarly.

Balance partition strategies: several possible ways. Dynamic partition
strategy, Static partitioning strategy —> no need a higher level navigation
lock.

Dynamic strategy: starting with 1, and grow.

And: why does the design doc use static partitioning? determining the size
of partitions is hard. what about starting with 1024 partitions.

Hotspot problem

A related task, HDFS-14617
<https://issues.apache.org/jira/browse/HDFS-14617> (Improve fsimage load
time by writing sub-sections to the fsimage index) writes multiple inode
sections and inode directory sections, and load sections in parallel. It
sounds like we can combine it with the fine-grained locking and partition
inode/inode directory sections by the namespace partitions.

Anu: snapshot complicates design. Renames. Copy on write?

Anu: suggest to implement this feature without snapshot support to simplify
design and implementation.

Konstantin: will develop in a feature branch. Feel free to pick up jiras or
share thoughts.

FoldedTreeSet implemented in HDFS-9260
<https://issues.apache.org/jira/browse/HDFS-9260> is relevant. Need to fix
or revert before developing the namespace partitioning feature.

On Mon, Aug 19, 2019 at 2:55 PM Wei-Chiu Chuang 
wrote:

> For this week,
> We will have Konstantin and the LinkedIn folks to discuss a recent project
> that's been baking for quite a while. This is an exciting project as it has
> the potential to improve NameNode's throughput by 40%.
>
> HDFS-14703 <https://issues.apache.org/jira/browse/HDFS-14703> NameNode
> Fine-Grained Locking
>
> Access instruction, and the past sync notes are available here:
> https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit?usp=sharing
>
> Reminder: We have Bi-weekly Hadoop storage online sync every other
> Wednesday.
> If there are no objections, I'd like to move the time to 10AM US pacific
> time (GMT-8)
>


Re: [DISCUSS] move storage community online sync schedule

2019-08-20 Thread Wei-Chiu Chuang
I don't see an objection so let's move to 10AM US Pacific Daylight Saving
Time (UTC-7) starting this Wednesday.
I'll update the relevant information.

Further reminder, Konstantin will talk about NameNode Fine-Grained Locking.

Please let me know if you have any feedbacks and ideas to share. Is the
discuss too developer-centric? Would you like to hear more about other
stuff? Should we talk about release plans?

Thanks all!

On Mon, Aug 19, 2019 at 8:45 AM Siddharth Wagle 
wrote:

> Correction: 10 am Pacific.
>
> On Mon, Aug 19, 2019, 8:44 AM Siddharth Wagle  wrote:
>
> > +1 for 10 pm.
> >
> > BR,
> > - Sid
> >
> > On Mon, Aug 19, 2019, 8:36 AM Wei-Chiu Chuang
> 
> > wrote:
> >
> >> I received some feedbacks that the bi-weekly storage online sync that
> >> happens Wednesday 9AM US pacific time (GMT-8) is too early for west
> coast
> >> folks, and the fact is that the majority of Hadoop developers are in the
> >> US.
> >>
> >> Would it make sense to move it to a later time to allow more US west
> coast
> >> particpants? Say 10AM US pacific time?
> >>
> >> Thanks,
> >> Weichiu
> >>
> >
>


Re: Hadoop Community Sync Up Schedule

2019-08-20 Thread Wei-Chiu Chuang
+1

On Mon, Aug 19, 2019 at 8:32 PM Wangda Tan  wrote:

> Hi folks,
>
> We have run community sync up for 1.5 months. I spoke to folks offline and
> got some feedback. Here's a summary of what I've observed from sync ups and
> talked to organizers.
>
> Following sync ups have very good participants (sometimes 10+ folks
> joined):
> - YARN/MR monthly sync up in APAC (Mandarin)
> - HDFS monthly sync up in APAC (Mandarin).
> - Submarine weekly sync up in APAC (Mandarin).
>
> Following sync up have OK-ish participants: (3-5 folks joined).
> - Storage monthly sync up in APAC (English)
> - Storage bi-weekly sync up in US (English)
> - YARN bi-weekly sync up in US (English).
>
> Following sync ups don't have good participants: (Skipped a couple of
> times).
> - YARN monthly sync up in APAC (English).
> - Submarine bi-weekly sync up in US (English).
>
> *So I'd like to propose the following changes and fixes of the schedule: *
> 1) Cancel the YARN/MR monthly sync up in APAC (English). Folks from APAC
> who speak English can choose to join the US session.
> 2) Cancel the Submarine bi-weekly sync up in US (English). Now Submarine
> developers and users are fast-growing in Mandarin-speaking areas. We can
> resume the sync if we do see demands from English-speaking areas.
> 3) Update the US sync up time from 9AM to 10AM PDT. 9AM is too early for
> most of the west-cost folks.
>
> *Following are fixes for the schedule:  *
> 1) In the proposal, repeats are not properly. (I used bi-weekly instead of
> 2nd/4th week as repeat frequency). I'd like to fix the frequency on Thu and
> it will take effect starting next week.
>
> Overall, thanks for everybody who participated in the sync ups. I do see
> community contributions grow in the last one month!
>
> Any thoughts about the proposal?
>
> Thanks,
> Wangda
>
>
>
>
> On Thu, Jul 25, 2019 at 11:53 AM 俊平堵  wrote:
>
> > Hi Folks,
> >
> >  Kindly remind that we have YARN+MR APAC sync today, and you are
> > welcome to join:
> >
> >
> > Time and Date:07/25 1:00 pm (CST Time)
> >
> > Zoom link:Zoom | https://cloudera.zoom.us/j/880548968
> >
> > Summary:
> >
> >
> https://docs.google.com/document/d/1GY55sXrekVd-aDyRY7uzaX0hMDPyh3T-AL1kUY2TI5M
> >
> >
> > Thanks,
> >
> >
> > Junping
> >
> >
> >
> > Wangda Tan  于2019年6月28日周五 上午2:57写道:
> >
> > > Hi folks,
> > >
> > > Here's the Hadoop Community Sync Up proposal/schedule:
> > >
> >
> https://docs.google.com/document/d/1GfNpYKhNUERAEH7m3yx6OfleoF3MqoQk3nJ7xqHD9nY/edit#heading=h.xh4zfwj8ppmn
> > >
> > > And here's calendar file:
> > >
> > >
> > >
> >
> https://calendar.google.com/calendar/ical/hadoop.community.sync.up%40gmail.com/public/basic.ics
> > >
> > > We gave it a try this week for YARN+MR and Submarine sync, feedbacks
> from
> > > participants seems pretty good, lots of new information shared during
> > sync
> > > up, and companies are using/developing Hadoop can better know each
> other.
> > >
> > > Next week there're 4 community sync-ups (Two Submarine for different
> > > timezones, one YARN+MR, one storage), please join to whichever you're
> > > interested:
> > >
> > > [image: image.png]
> > >
> > > Zoom info and notes can be found in the Google calendar invitation.
> > >
> > > Thanks,
> > > Wangda
> > >
> >
>


Re: Hadoop storage community online sync

2019-08-20 Thread Wei-Chiu Chuang
Great question!
Currently Pacific Daylight Saving Time is UTC-7, and Pacific Standard Time,
UTC-8 doesn't start until November 3rd.
I am being too US-centric, but if the purpose is to invite more people,
where many of them are US west coast based, we should do this following the
US pacific time zone (probably more specifically, California)

So GMT-7 it is.

On Mon, Aug 19, 2019 at 11:16 PM Akira Ajisaka  wrote:

> Thank you for the information.
>
> Now US pacific time is GMT-7, isn't it?
>
> -Akira
>
> On Tue, Aug 20, 2019 at 6:56 AM Wei-Chiu Chuang
>  wrote:
> >
> > For this week,
> > We will have Konstantin and the LinkedIn folks to discuss a recent
> project that's been baking for quite a while. This is an exciting project
> as it has the potential to improve NameNode's throughput by 40%.
> >
> > HDFS-14703 NameNode Fine-Grained Locking
> >
> > Access instruction, and the past sync notes are available here:
> https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit?usp=sharing
> >
> > Reminder: We have Bi-weekly Hadoop storage online sync every other
> Wednesday.
> > If there are no objections, I'd like to move the time to 10AM US pacific
> time (GMT-8)
>
> -
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>
>


Hadoop storage community online sync

2019-08-19 Thread Wei-Chiu Chuang
For this week,
We will have Konstantin and the LinkedIn folks to discuss a recent project
that's been baking for quite a while. This is an exciting project as it has
the potential to improve NameNode's throughput by 40%.

HDFS-14703  NameNode
Fine-Grained Locking

Access instruction, and the past sync notes are available here:
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit?usp=sharing

Reminder: We have Bi-weekly Hadoop storage online sync every other
Wednesday.
If there are no objections, I'd like to move the time to 10AM US pacific
time (GMT-8)


[DISCUSS] move storage community online sync schedule

2019-08-19 Thread Wei-Chiu Chuang
I received some feedbacks that the bi-weekly storage online sync that
happens Wednesday 9AM US pacific time (GMT-8) is too early for west coast
folks, and the fact is that the majority of Hadoop developers are in the US.

Would it make sense to move it to a later time to allow more US west coast
particpants? Say 10AM US pacific time?

Thanks,
Weichiu


Re: What do you think about HDFS using GFS2 (shared disk file system) or GPFS (parallel filesystem) rather than local file system?

2019-08-17 Thread Wei-Chiu Chuang
Not familiar with GPFS, but looking at IBM's website, GPFS has a client
that emulates Hadoop RPC
https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adv_Overview.htm

So you can just use GPFS like HDFS. It may be the quickest way to approach
this use case and is supported.
Not sure about the performance though.

Looking at Cloudera's user doc
https://www.cloudera.com/documentation/other/reference-architecture/PDF/cloudera_ref_arch_stg_dev_accept_criteria.pdf

*High-throughput Storage Area Network (SAN) and other shared storage
solutions can present remote block devices to virtual machines in a
flexible and performant manner that is often indistinguishable from a local
disk. An Apache Hadoop workload provides a uniquely challenging IO profile
to these storage solutions, and this can have a negative impact on the
utility and stability of the Cloudera Enterprise cluster, and to other work
that is utilizing the same storage backend.*

*Warning: Running CDH on storage platforms other than direct-attached
physical disks can provide suboptimal performance. Cloudera Enterprise and
the majority of the Hadoop platform are optimized to provide high
performance by distributing work across a cluster that can utilize data
locality and fast local I/O.*

On Sat, Aug 17, 2019 at 2:12 AM Daegyu Han  wrote:

> Hi all,
>
> As far as I know, HDFS is designed to target local file systems like ext4
> or xfs.
>
> Is it a bad approach to use SAN technology as storage for HDFS?
>
> Thank you,
> Daegyu
> ᐧ
>


Re: Hadoop HDFS Fault Injection

2019-08-14 Thread Wei-Chiu Chuang
Aleksander,
Yes I am aware of that doc but I've never seen any one maintaining that
piece of code in the last 4 years. And I don't think any one had ever used
that.

On Wed, Aug 14, 2019 at 5:12 AM Aleksander Buła <
ab370...@students.mimuw.edu.pl> wrote:

> Hi,
>
> I would like to ask whether the Fault Injection Framework (
> https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-hdfs/FaultInjectFramework.html)
> is still supported in the Hadoop HDFS?
>
> I think that this documentation was created around *v0.23* and was not
> updated since then. Additionally, I have done some repository digging and
> found out that the ant targets, mentioned in the documentation, were
> deleted in 2012. Right now none of the files in the repository defines
> these targets but there the projects contains multiple *.aj *files -
> therefore I assume they somehow can be used.
>
> Does anyone here know how to compile and run fault injection tests in a
> newer version of Hadoop (exactly* v2.6.0*)?  It would mean a world to me.
>
> Best regards
> Alex
>


Please file a JIRA for your PR

2019-08-08 Thread Wei-Chiu Chuang
The Hadoop community welcome your patch contribution, and like increasingly
patches are submitted via GitHub Pull Requests.

That is great, as it reduces the friction to review code & commit code.

However, please make sure to file a jira for your PR, as described in the How
to Contribute
 wiki.
The fact is, if your PR isn't associated with a JIRA ID, and add the JIRA
ID in the title of your PR, your PR is not likely going to be noticed by
committers. Most Hadoop committers use Apache JIRA to track issues, and
folks usually find it easier to exchange in-depth technical discussion over
JIRAs than PR.

Thank you and happy patching!
Wei-Chiu


Hadoop Storage Online Sync Notes 8/5/2019

2019-08-07 Thread Wei-Chiu Chuang
Very happy to have CR from Uber leading today's discussion. Here's today's
sync meeting notes.
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit
8/5/2019

CR Hota (Uber) gave an update on Router Based Federation

Attendee: Cloudera (Weichiu, Adam), Uber (CR Hota) and Target (Craig)

Rename: There is a change in Hive that upon exception, do a copy instead.

How/where can the community help:it already support all NN APIs, running in
production, mostly now is the efficiency improvement.

How to migrate from non-federation to RBF —> easy. still use hfs:// scheme.

Will have to update metadata (HMS)

How to migrate from ViewFS based federation to RBF —> ViewFS use view’s://
so it’ll be harder to migrate.

View FS based is limited to 4 namespaces. There is no such limit in RBF.
Uber is already at 5 namespaces.

Cluster utilization

rebalancer. Not a priority at Uber because of UDestinty.

Router HA

supported. All routers’ state is synchronized. (Uber: has 10 routers in one
cluster)

Latency

compare to single Namenode which is bottlenecked in 1 NN lock.

Read-only name node help solve this problem too.

Presto is more latency sensitive. So Uber made a change to support
“read-only router”

In general, very negligible latency. If there is, just add more routers.

Uber doesn’t want to manage 4-5 thousand clusters. They want to manage some
set of 1000 thousand clusters in the future.

Isolation

There is a current problem. Very important for production deployment. See
HDFS-14090: fairness in router.


Let me know your feedback. Is this the right topic you are looking for? Do
you want to present other topics? Development discussion, demos, best
practices are welcomed.

Best,
Weichiu


Topics for Hadoop storage online sync

2019-08-05 Thread Wei-Chiu Chuang
Hello!

For this week's community online sync (English, Wednesday 9am US Pacific
Time), we will have CR Hota from Uber to talk about the latest update in
Router Based Federation.

He will touch upon the following topics:
1. Security (Development and zookeeper scale testing learnings)
2. Isolation for multiple clusters
3. Routers for Observer namenodes (Our internal design). Open source
implementation is yet to be done.
3. DNS Support

In case you missed the past community online sync, here's the information
to access (Zoom) and meeting notes:
https://docs.google.com/document/d/1jXM5Ujvf-zhcyw_5kiQVx6g-HeKe-YGnFS_1-qFXomI/edit

I am also looking for topics and maybe demos for the upcoming Mandarin
community sync call this week and in the future. So definitely reach out to
me so we can announce it in advance.

Thanks all!
Weichiu


Re: [DISCUSS] EOL 2.8 or another 2.8.x release?

2019-07-25 Thread Wei-Chiu Chuang
My bad -- Didn't realize I was looking at the old Hadoop page.
Here's the correct list of releases. https://hadoop.apache.org/releases.html

On Thu, Jul 25, 2019 at 12:49 AM 张铎(Duo Zhang) 
wrote:

> IIRC we have a 2.8.5 release?
>
> On the download page:
>
> 2.8.5 2018 Sep 15
>
> Wei-Chiu Chuang  于2019年7月25日周四 上午9:39写道:
>
> > The last 2.8 release (2.8.4) was made in the last May, more than a year
> > ago. https://hadoop.apache.org/old/releases.html
> >
> > How do folks feel about the fate of branch-2.8? During the last community
> > meetup in June, it sounds like most users are still on 2.8 or even 2.7,
> so
> > I don't think we want to abandon 2.8 just yet.
> >
> > I would personally want to urge folks to move up to 3.x, so I can stop
> > cherrypicking stuff all the way down into 2.8. But it's not up to me
> along
> > to decide :)
> >
> > How do people feel about having another 2.8 release or two? I am not
> saying
> > I want to drive it, but I want to raise the awareness that folks are
> still
> > on 2.8 and there's not been an update for over a year.
> >
> > Thoughts?
> >
>


[DISCUSS] EOL 2.8 or another 2.8.x release?

2019-07-24 Thread Wei-Chiu Chuang
The last 2.8 release (2.8.4) was made in the last May, more than a year
ago. https://hadoop.apache.org/old/releases.html

How do folks feel about the fate of branch-2.8? During the last community
meetup in June, it sounds like most users are still on 2.8 or even 2.7, so
I don't think we want to abandon 2.8 just yet.

I would personally want to urge folks to move up to 3.x, so I can stop
cherrypicking stuff all the way down into 2.8. But it's not up to me along
to decide :)

How do people feel about having another 2.8 release or two? I am not saying
I want to drive it, but I want to raise the awareness that folks are still
on 2.8 and there's not been an update for over a year.

Thoughts?


Re: Namenode crashes in 2.7.2

2019-07-11 Thread Wei-Chiu Chuang
Hi Kumar,
It seems like the fix for this bug addresses the root cause of the problem,
but doesn't seem to help when the NameNode already suffers this problem.
I would suggest you to download Hadoop 2.7.2, add a try catch block to
catch/swallow the NPE exception. Rebuild it and see if the NameNode can
start and check point properly & restart again.

On Thu, Jul 11, 2019 at 10:44 PM kumar r  wrote:

> Hi,
>
> In Hadoop-2.7.2, i am getting same error reported in here
> https://issues.apache.org/jira/browse/HDFS-12985
>
> Is there patch available for hadoop-2.7.2 version? How can i restart
> namenode without null pointer exception?
>
> Thanks,
> Kumar
>


Re: Is hadoop maintained?

2019-07-07 Thread Wei-Chiu Chuang
Yuri,
FreeBSD is not currently a supported operating system for Hadoop, and as
far as I know it receives pretty limited attention in the community.

Last time I checked, Hadoop source code does not compile on FreeBSD (Hadoop
2.x) out of box and FreeBSD's port has some source code change in order to
pass compilation.

Based on the error message, it looks to me the error is within
dev-support/bin/dist-layout-stitching

function findfileindir()
{
  declare file="$1"
  declare dir="${2:-./share}"
  declare count

  count=$(find "${dir}" -iname "${file}" | wc -l)

  #shellcheck disable=SC2086
  echo ${count}
}


if [[ -f "${src}" ]]; then
  srcname=${src##*/}
  if [[ "${srcname}" != *.jar ||
$(findfileindir "${srcname}") -eq "0" ]]; then
destdir=$(dirname "${dest}")
mkdir -p "${destdir}"
cp -p "${src}" "${dest}"
  fi

and this script is invoked by Maven hadoop-dist/pom.xml


  org.codehaus.mojo
  exec-maven-plugin
  

  dist
  prepare-package
  
exec
  
  
${shell-executable}
${project.build.directory}
false

  
${basedir}/../dev-support/bin/dist-layout-stitching
  ${project.version}
  ${project.build.directory}

  



I hope this'll get you somewhere.

On Fri, Jul 5, 2019 at 7:33 AM Yuri  wrote:

> I created this bug report, in an attempt to fix the FreeBSD port:
> https://issues.apache.org/jira/browse/HADOOP-16388 but there was no
> answer.
>
>
> Does anybody know if Hadoop is a maintained project, and if yes, how to
> get a hold of somebody who can help with this bug?
>
>
> Thank you,
>
> Yuri
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: NVMe Over fabric performance on HDFS

2019-06-25 Thread Wei-Chiu Chuang
There are a few Intel folks contributor NVMe related features in HDFS. They
are probably the best source for this questions.

Without having access to the NVMe hardware, it is hard to tell. I learned
GCE offers Intel Optane DC Persistent Memory attached instances. That can
be used for tests if any one is interested.

I personally have not received reports regarding unexpected performance
issue with NVMe with HDFS. A lot of test tuning could result in better
performance. File size can have a great impact in a TestDFSIO, for example.
You should also make sure you saturate the local NVMe rather than network
bandwidth. Try set replication factor=1? With the default replication
factor you pretty much saturate network rather than storage, I guess.

The Intel folks elected to implement DCPMM as a HDFS cache rather than a
storage. There's probably some consideration behind that.

On Tue, Jun 25, 2019 at 10:29 AM Daegyu Han  wrote:

> Hi Anu,
>
> Each datanode has own Samsung NVMe SSD which is on storage node.
> In other words, just separate compute node and storage (nvme ssd).
>
> I know that the maximum bandwidth of my Samsung NVMe SSD is about 3GB / s.
>
> Experimental results of TestDFSIO and HDFS_API show that the
> performance of local NVMe SSD is up to 2GB / s, while NVMeOF SSD has
> 500 ~ 800MB / s performance.
> Even IPoIB using InfiniBand has a bandwidth of 1GB / s.
>
> In research papers evaluating NVMeOF through FIO or KV Store
> applications, the performance of NVMeOF is similar to that of local
> SSD.
> They said also, in order to improve NVMeOF performance as much as
> local level, it is required to perform parallel IO.
> Why does not the performance of NVMeOF IO bandwidth in HDFS be as good as
> local?
>
> Regards,
> Daegyu
>
> 2019년 6월 26일 (수) 오전 12:04, Anu Engineer 님이 작성:
> >
> > Is your NVMe shared and all datanodes sending I/O to the same set of
> disks ? Is it possible for you to see the I/O queue length of the NVMe
> Devices?
> > I would suggest that you try to find out what is causing the perf issue,
> and once we know in ball park where the issue is -- that is, is it disks or
> HDFS, it might be possible to see what we can do.
> >
> >
> >
> > Thanks
> > Anu
> >
> >
> > On Tue, Jun 25, 2019 at 7:20 AM Daegyu Han  wrote:
> >>
> >> Hi all,
> >>
> >> I am using storage disaggregation by mounting nvme ssds on the storage
> node.
> >>
> >> When we connect the compute node and the storage node with nvme over
> >> fabric (nvmeof) and test it, performance is much lower than that of
> >> local storage (DAS).
> >>
> >> In general, we know that applications need to increase io parallelism
> >> and io size to improve the performance of nvmeof.
> >>
> >> How can I change the settings of hdfs specifically to improve the io
> >> performance of NVMeOF in HDFS?
> >>
> >> Best regards,
> >> Daegyu
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> >> For additional commands, e-mail: user-h...@hadoop.apache.org
> >>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re: Python Hadoop Example

2019-06-16 Thread Wei-Chiu Chuang
Thanks Artem,
Looks interesting. I honestly didn't know what Hadoop Streaming API is used
for.
Here are more references:
https://hadoop.apache.org/docs/r3.2.0/hadoop-streaming/HadoopStreaming.html

I think it brings to another question: how do we treat Python as a first
class citizen. Especially for data science use cases, Python is *the*
language.
For example, we have Java and C and (in Hadoop 3.2) C++ client for HDFS.
But Hadoop does not ship a Python client.
I see a number of Python libraries that support webhdfs. It's not clear to
me how well they perform, and if they support more advanced features like
encryption/Kerberos.

NFS gateway is a possibility. Fuse-dfs is another option. But we know they
don't work at scale, and the community seems to lost the steam to improve
NFS/fuse-dfs.

Thoughts?

On Sun, Jun 16, 2019 at 6:52 AM Artem Ervits  wrote:

>
> https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
>
> On Sun, Jun 16, 2019, 9:18 AM Mike IT Expert 
> wrote:
>
>> Please let me know where I can find a good/simple example of mapreduce
>> Python code running on Hadoop. Like tutorial or sth.
>>
>> Thank you
>>
>>
>>


Re: HDFS Scalability Limit?

2019-06-15 Thread Wei-Chiu Chuang
Thank you, Kihwal for the insightful comments!

As I understand it, Yahoo's ops team has a good control of application
behavior. I tend to be conservative in terms of number of files&blocks and
heap size. We don't have such luxury, and our customers have a wide
spectrum of workloads and features (e.g., snapshots, data at-rest
encryption, Impala).

Yes -- decomm/recomm is a pain, and I am working with my colleague, @Stephen
O'Donnell  , to address this problem. Have you
tried maintenance mode? It's in Hadoop 2.9 but a number of decomm/recomm
needs are alleviated by maintenance mode.

I know Twitter is a big user of maintenance mode, and I'm wondering if
Twitter folks can offer some experience with it at large scale. CDH
supports maintenance mode, but our users don't seem to be quite familiar
with it. Are there issues that were dealt with, but not reported in the
JIRA? Basically, I'd like to know the operational complexity of this
feature at large scale.

On Thu, Jun 13, 2019 at 4:00 PM Kihwal Lee 
wrote:

> Hi Wei-Chiu,
>
> We have experience with 5,000 - 6,000 node clusters.  Although it ran/runs
> fine, any heavy hitter activities such as decommissioning needed to be
> carefully planned.   In terms of files and blocks, we have multiple
> clusters running stable with over 500M files and blocks.  Some at over 800M
> with the max heap at 256GB. It can probably go higher, but we haven't done
> performance testing & optimizations beyond 256GB yet.  All our clusters are
> un-federated. Funny how the feature was developed in Yahoo! and ended up
> not being used here. :)  We have a cluster with about 180PB of provisioned
> space. Many clusters are using over 100PB in their steady state.  We don't
> run datanodes too dense, so can't tell what the per-datanode limit is.
>
> Thanks and 73
> Kihwal
>
> On Thu, Jun 13, 2019 at 1:57 PM Wei-Chiu Chuang 
> wrote:
>
>> Hi community,
>>
>> I am currently drafting a HDFS scalability guideline doc, and I'd like to
>> understand any data points regarding HDFS scalability limit. I'd like to
>> share it publicly eventually.
>>
>> As an example, through my workplace, and through community chatters, I am
>> aware that HDFS is capable of operating at the following scale:
>>
>> Number of DataNodes:
>> Unfederated: I can reasonably believe a single HDFS NameNode can manage up
>> to 4000 DataNodes. Is there any one who would like to share an even larger
>> cluster?
>>
>> Federated: I am aware of one federated HDFS cluster composed of 20,000
>> DataNodes. JD.com
>> <
>> https://conferences.oreilly.com/strata/strata-eu-2018/public/schedule/detail/64692
>> >
>> has a 15K DN cluster and 210PB total capacity. I suspect it's a federated
>> HDFS cluster.
>>
>> Number of blocks & files:
>> 500 million files&blocks seems to be the upper limit at this point. At
>> this
>> scale NameNode consumes around 200GB heap, and my experience told me any
>> number beyond 200GB is unstable. But at some point I recalled some one
>> mentioned a 400GB NN heap.
>>
>> Amount of Data:
>> I am aware a few clusters more than 100PB in size (federated or not) --
>> Uber, Yahoo Japan, JD.com.
>>
>> Number of Volumes in a DataNode:
>> DataNodes with 24 volumes is known to work reasonably well. If DataNode is
>> used for archival use cases, a DN can have up to 48 volumes. This is
>> certainly hardware dependent, but if I know where the current limit is, I
>> can start optimizing the software.
>>
>> Total disk space:
>> CDH
>> <
>> https://www.cloudera.com/documentation/enterprise/release-notes/topics/hardware_requirements_guide.html#concept_fzz_dq4_gbb
>> >
>> recommends no more than 100TB per DataNode. Are there successful
>> deployments that install more than this number? Of course, you can easily
>> exceed this number if it is used purely for data archival.
>>
>>
>> What are other scalability limits that people are interested?
>>
>> Best,
>> Wei-Chiu
>>
>


HDFS Scalability Limit?

2019-06-13 Thread Wei-Chiu Chuang
Hi community,

I am currently drafting a HDFS scalability guideline doc, and I'd like to
understand any data points regarding HDFS scalability limit. I'd like to
share it publicly eventually.

As an example, through my workplace, and through community chatters, I am
aware that HDFS is capable of operating at the following scale:

Number of DataNodes:
Unfederated: I can reasonably believe a single HDFS NameNode can manage up
to 4000 DataNodes. Is there any one who would like to share an even larger
cluster?

Federated: I am aware of one federated HDFS cluster composed of 20,000
DataNodes. JD.com

has a 15K DN cluster and 210PB total capacity. I suspect it's a federated
HDFS cluster.

Number of blocks & files:
500 million files&blocks seems to be the upper limit at this point. At this
scale NameNode consumes around 200GB heap, and my experience told me any
number beyond 200GB is unstable. But at some point I recalled some one
mentioned a 400GB NN heap.

Amount of Data:
I am aware a few clusters more than 100PB in size (federated or not) --
Uber, Yahoo Japan, JD.com.

Number of Volumes in a DataNode:
DataNodes with 24 volumes is known to work reasonably well. If DataNode is
used for archival use cases, a DN can have up to 48 volumes. This is
certainly hardware dependent, but if I know where the current limit is, I
can start optimizing the software.

Total disk space:
CDH

recommends no more than 100TB per DataNode. Are there successful
deployments that install more than this number? Of course, you can easily
exceed this number if it is used purely for data archival.


What are other scalability limits that people are interested?

Best,
Wei-Chiu


Re: [DISCUSS] HDFS roadmap/wish list

2019-06-13 Thread Wei-Chiu Chuang
Thank you. I really appreciate your feedback as I don't always know the
detailed use case for a feature. (For me, it's mostly "hey, this thing is
broken, fix it")

What are the rest of the community thinks? This is a great opportunity to
share what you think.

My answers inline:

On Wed, Jun 12, 2019 at 1:12 AM Julien Laurenceau <
julien.laurenc...@pepitedata.com> wrote:

> Hi,
>
> I am not absolutely sure it is not already in a roadmap or supported, but
> I would appreciate those two features :
>
> - First feature : I would also like to be able to use a dedicated
> directory in HDFS as a /tmp directory leveraging RAMFS for high performing
> checkpoint of Spark Jobs without using Alluxio or Ignite.
>
My current issue is that the RAMFS is only useful with replication factor
> x1 (in order to avoid network).
> My default replication factor is x3, but I would need a way to set
> replication factor x1 on a specific directory (/tmp) for all new writes
> coming to this directory.
> Currently if I use "hdfs setrep 1 /tmp" it only works for blocks already
> written.
> For example, this could be done by specifying the replication factor at
> the storage policy level.
> In my view this would dramatically improve the interest of the
> Lazy-persist storage policy.
>

I am told LAZY_PERSIST is never considered a completed feature, and two
Hadoop distros, CDH and HDP don't support it.

But now that I understand the use case, it looks useful now.

> > From the Doc > Note 1: The Lazy_Persist policy is useful only for single
> replica blocks. For blocks with more than one replicas, all the replicas
> will be written to DISK since writing only one of the replicas to RAM_DISK
> does not improve the overall performance.
> In the current state of HDFS configuration, I only see the following hack
> (not tested) to implement such a solution : Configure HDFS replication x1
> as default configuration and use Erasure Coding RS(6,3) for the main
> storage by attaching an ec storage policy on all directories except /tmp.
>
> hdfs ec -setPolicy -path  [-policy ]
>
>
>
> - Second feature: a bandwidth throttling dedicated to the re-replication
> in case of a failed datanode.
> Something similar to the option dedicated to the balancing algorithm
> dfs.datanode.balance.bandwidthPerSecbut only for re-replication.
>
I am pretty sure I've got people asking about this before a few times.

>
> Thanks and regards
> JL
>
> Le lun. 10 juin 2019 à 19:08, Wei-Chiu Chuang 
> a écrit :
>
>> Hi!
>>
>> I am soliciting feedbacks for HDFS roadmap items and wish list in the
>> future Hadoop releases. A community meetup
>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
>> is happening soon, and perhaps we can use this thread to converge on things
>> we should talk about there.
>>
>> I am aware of several major features that merged into trunk, such as RBF,
>> Consistent Standby Serving Reads, as well as some recent features that
>> merged into 3.2.0 release (storage policy satisfier).
>>
>> What else should we be doing? I have a laundry list of supportability
>> improvement projects, mostly about improving performance or making
>> performance diagnostics easier. I can share the list if folks are
>> interested.
>>
>> Are there things we should do to make developer's life easier or things
>> that would be nice to have for downstream applications? I know Sahil Takiar
>> made a series of improvements in HDFS for Impala recently, and those
>> improvements are applicable to other downstreamers such as HBase. Or would
>> it help if we provide more Hadoop API examples?
>>
>


Re: [DISCUSS] HDFS roadmap/wish list

2019-06-11 Thread Wei-Chiu Chuang
Jeff,

Would Hadoop encryption zone/Transparent Data Encryption (TDE)
<https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html>
address this use case? Files within encryption zone are encrypted
transparently. Data is encrypted on DataNodes and are decrypted at client
side. Or would Data Transfer Encryption
<https://hadoop.apache.org/docs/r2.6.0/hadoop-project-dist/hadoop-common/SecureMode.html#Data_Encryption_on_Block_data_transfer.>
work for you? These are pretty mature these days so probably worth trying.

Definitely let me know if the encryption system I mentioned above doesn't
work for you. I know there are assumptions behind the design and doesn't
work for all use cases (it doesn't support per column encryption key in
HBase)

On Mon, Jun 10, 2019 at 8:07 PM Jeff Hubbs  wrote:

> Hi, Wei-Chiu -
>
> I don't know if this is something already in the pipeline for 3.x, but I'd
> like to see a mechanism in HDFS that encrypts blocks pre-storage such that
> I'd only have to manage keys in one place (NameManager?). If that
> capability existed, then I could move blocks around an unsafe network
> and/or not have to worry about my worker nodes having volume-level or
> whole-disk-level encryption. Even if I have Hadoop traffic only crossing a
> LAN that's captive to the cluster, I might still have to worry about worker
> nodes being stolen outright or having the drive(s) taken out of them.
>
> - Jeff
>
> On 6/10/19 8:40 PM, Wei-Chiu Chuang wrote:
>
>
> Thank you Sudeep for the feedback,
>
> To be more specific, what sort of examples are??you looking for???
>
> On another note, I had written some docs of extended length about Hadoop
> code base and internal designs. I should probably make those public to
> share the knowledge (or fix my grammar errors, for that matter)
>
> On Mon, Jun 10, 2019 at 12:11 PM Sudeep Singh Thakur <
> sudeepthaku...@gmail.com> wrote:
>
>> Hi ,
>>
>> Examples are most helpful for developer. Please add examples as much as
>> we can.
>>
>> Thanks??
>> Sudeep Thakur
>>
>> On Mon, Jun 10, 2019, 10:38 PM Wei-Chiu Chuang
>>   wrote:
>>
>>> Hi!
>>>
>>> I am soliciting feedbacks for HDFS roadmap items and wish list in the
>>> future Hadoop releases. A community meetup
>>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
>>> is happening soon, and perhaps we can use this thread to converge on things
>>> we should talk about there.
>>>
>>> I am aware of several major features that merged into trunk, such as
>>> RBF, Consistent Standby Serving Reads, as well as some recent features that
>>> merged into 3.2.0 release (storage policy satisfier).
>>>
>>> What else should we be doing? I have a laundry list of supportability
>>> improvement projects, mostly about improving performance or making
>>> performance diagnostics easier. I can share the list if folks are
>>> interested.
>>>
>>> Are there things we should do to make developer's life easier or things
>>> that would be nice to have for downstream applications? I know??Sahil
>>> Takiar made a series of improvements in HDFS for Impala recently, and those
>>> improvements are applicable to other downstreamers such as HBase. Or would
>>> it help if we provide more Hadoop API examples???
>>>
>>
>


Re: [DISCUSS] HDFS roadmap/wish list

2019-06-10 Thread Wei-Chiu Chuang
Thank you Sudeep for the feedback,

To be more specific, what sort of examples are you looking for?

On another note, I had written some docs of extended length about Hadoop
code base and internal designs. I should probably make those public to
share the knowledge (or fix my grammar errors, for that matter)

On Mon, Jun 10, 2019 at 12:11 PM Sudeep Singh Thakur <
sudeepthaku...@gmail.com> wrote:

> Hi ,
>
> Examples are most helpful for developer. Please add examples as much as we
> can.
>
> Thanks
> Sudeep Thakur
>
> On Mon, Jun 10, 2019, 10:38 PM Wei-Chiu Chuang
>  wrote:
>
>> Hi!
>>
>> I am soliciting feedbacks for HDFS roadmap items and wish list in the
>> future Hadoop releases. A community meetup
>> <https://www.meetup.com/Hadoop-Contributors/events/262055924/?rv=ea1_v2&_xtd=gatlbWFpbF9jbGlja9oAJGJiNTE1ODdkLTY0MDAtNDFiZS1iOTU5LTM5ZWYyMDU1N2Q4Nw>
>> is happening soon, and perhaps we can use this thread to converge on things
>> we should talk about there.
>>
>> I am aware of several major features that merged into trunk, such as RBF,
>> Consistent Standby Serving Reads, as well as some recent features that
>> merged into 3.2.0 release (storage policy satisfier).
>>
>> What else should we be doing? I have a laundry list of supportability
>> improvement projects, mostly about improving performance or making
>> performance diagnostics easier. I can share the list if folks are
>> interested.
>>
>> Are there things we should do to make developer's life easier or things
>> that would be nice to have for downstream applications? I know Sahil Takiar
>> made a series of improvements in HDFS for Impala recently, and those
>> improvements are applicable to other downstreamers such as HBase. Or would
>> it help if we provide more Hadoop API examples?
>>
>


[DISCUSS] HDFS roadmap/wish list

2019-06-10 Thread Wei-Chiu Chuang
Hi!

I am soliciting feedbacks for HDFS roadmap items and wish list in the
future Hadoop releases. A community meetup

is happening soon, and perhaps we can use this thread to converge on things
we should talk about there.

I am aware of several major features that merged into trunk, such as RBF,
Consistent Standby Serving Reads, as well as some recent features that
merged into 3.2.0 release (storage policy satisfier).

What else should we be doing? I have a laundry list of supportability
improvement projects, mostly about improving performance or making
performance diagnostics easier. I can share the list if folks are
interested.

Are there things we should do to make developer's life easier or things
that would be nice to have for downstream applications? I know Sahil Takiar
made a series of improvements in HDFS for Impala recently, and those
improvements are applicable to other downstreamers such as HBase. Or would
it help if we provide more Hadoop API examples?


Re: Webhdfs and S3

2019-05-22 Thread Wei-Chiu Chuang
You can start 2 httpfs servers (or even more), and let one set fs.defaultFS
to s3a://, and the other set to hdfs.
Will that work for you? Or is this not what you need?

On Wed, May 22, 2019 at 3:40 PM Joseph Henry  wrote:

> I thought about that, but we need to be able to access storage in native
> hdfs as well as S3 in the same cluster. If we change fs.defaultFS then I
> would not be able to access the HDFS storage.
>
>
>
> *From:* Wei-Chiu Chuang 
> *Sent:* Wednesday, May 22, 2019 9:36 AM
> *To:* Joseph Henry 
> *Cc:* user@hadoop.apache.org
> *Subject:* Re: Webhdfs and S3
>
>
>
> *EXTERNAL*
>
> I've never tried, but it seems possible to start a Httpfs server with
> fs.defaultFS = s3a://your-bucket
>
> Httpfs server speaks WebHDFS protocol so your webhdfs client can use
> webhdfs. And then for each webhdfs request, httpfs server translates that
> into the corresponding FileSystem API call. If the fs.defaultFS is the
> s3a:// URI, it may be able to talk to s3.
>
>
>
> On Wed, May 22, 2019 at 3:29 PM Joseph Henry  wrote:
>
> Hey,
>
>
>
> I am not sure if this is the correct mailing list for this question, but I
> will start here.
>
>
>
> Our client application needs to support accessing S3 buckets from hdfs. We
> can do this with the Java API using the s3a:// scheme, but also need a way
> to access the same files in S3 via the HDFS REST API.
>
>
>
> Is there a way to access the data stored in S3 via WEBHDFS?
>
>
>
> Thanks,
>
> Joseph Henry.
>
>
>
>


Re: Webhdfs and S3

2019-05-22 Thread Wei-Chiu Chuang
I've never tried, but it seems possible to start a Httpfs server with
fs.defaultFS = s3a://your-bucket
Httpfs server speaks WebHDFS protocol so your webhdfs client can use
webhdfs. And then for each webhdfs request, httpfs server translates that
into the corresponding FileSystem API call. If the fs.defaultFS is the
s3a:// URI, it may be able to talk to s3.

On Wed, May 22, 2019 at 3:29 PM Joseph Henry  wrote:

> Hey,
>
>
>
> I am not sure if this is the correct mailing list for this question, but I
> will start here.
>
>
>
> Our client application needs to support accessing S3 buckets from hdfs. We
> can do this with the Java API using the s3a:// scheme, but also need a way
> to access the same files in S3 via the HDFS REST API.
>
>
>
> Is there a way to access the data stored in S3 via WEBHDFS?
>
>
>
> Thanks,
>
> Joseph Henry.
>
>
>


Re: Right to be forgotten and HDFS

2019-04-15 Thread Wei-Chiu Chuang
Wow, Chao, didn't realize you guys are making Hudi into Apache :)
HDFS is generally not a good fit for this use case. I've seen people using
Kudu for GDPR compliance.

On Mon, Apr 15, 2019 at 11:11 AM Chao Sun  wrote:

> Checkout Hudi (https://github.com/apache/incubator-hudi) which adds
> upsert functionality on top of columnar data such as Parquet.
>
> Chao
>
> On Mon, Apr 15, 2019 at 10:49 AM Vinod Kumar Vavilapalli <
> vino...@apache.org> wrote:
>
>> If one uses HDFS as raw file storage where a single file intermingles
>> data from all users, it's not easy to achieve what you are trying to do.
>>
>> Instead, using systems (e.g. HBase, Hive) that support updates and
>> deletes to individual records is the only way to go.
>>
>> +Vinod
>>
>> On Apr 15, 2019, at 1:32 AM, Ivan Panico  wrote:
>>
>> Hi,
>>
>> Recent GDPR introduced a new right for people : the right to be
>> forgotten. This right means that if an organization is asked by a customer
>> to delete all his data, the organization have to comply most of the time
>> (there are conditions which can suspend this right but that's besides my
>> point).
>>
>> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see
>> where I'm going. What would be the best way to implement this line deletion
>> feature (supposing that when a customer asks for a delete of all his data,
>> the organization would have to delete some lines in some HDFS files).
>>
>> Right now I'm going for the following :
>>
>>- Create a key-value base (user, [files])
>>- On file writing, feed this base with the users and file location
>>(by appending or updating a key).
>>- When the deletion is requested by the user "john", look in that
>>base and rewrite all the files of the "john" key (read the file in 
>> memmory,
>>suppress the lines of "john", rewrite the files)
>>
>>
>> Would this be the most hadoop way to do that ?
>> I discarded some cryptoshredding like solution because the HDFS data has
>> to be readable by some mutliple proprietary softwares and by users at some
>> point and I'm not sur how to incorporate a decyphering step for all those
>> uses cases.
>> Also, I came up with this table solution because a violent grep for some
>> key on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ?
>>
>> Thanks for your help,
>> Best regards
>>
>>
>>


Re: Files vs blocks

2019-01-29 Thread Wei-Chiu Chuang
I don't feel this is strictly a small file issue (since I am not seeing the
average file size)
But it looks like your directory/file ratio is way too low. I've seen that
when Hive creates too many partitions. That can render Hive queries
inefficient.

On Tue, Jan 29, 2019 at 2:09 PM Sudhir Babu Pothineni 
wrote:

>
> One of Hadoop cluster I am working
>
> 85,985,789 files and directories, 58,399,919 blocks = 144,385,717 total
> file system objects
>
> Heap memory used 132.0 GB of 256 GB Heap Memory.
>
> I feel it’s odd the ratio of files vs blocks way higher showing more of
> small files problem,
>
> But the cluster working fine. Am I worrying unnecessarily? we are using
> Hadoop 2.6.0
>
> Thanks
> Sudhir
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>
>


Re:

2018-12-20 Thread Wei-Chiu Chuang
+Hdfs-dev 
Hi Shuubham,

Just like to clarify a bit. What's the purpose of this work? Is this for
the general block placement policy in HDFS, or the
balancer/mover/diskbalancer, or decommissioning/recommissioning? Block
placement is determined by NameNode. Do you intend to shorten the time to
decide where a block is placed? Do you want to reduce the time such that
re-replication takes less time?

I'm asking this because I don't think there's ever a placementmonitor or a
blockmonitor class.

On Wed, Dec 19, 2018 at 10:36 PM Shuubham Ojha 
wrote:

> Hello All,
>
>I am Shuubham Ojha a graduate researcher with the
> University Of Melbourne. We have developed a block placement strategy which
> optimises delay associated with reconstruction. As a result of this
> optimisation problem, we get a placement matrix for blocks which tells us
> which block has to be placed at which node. We have been able to implement
> this strategy in Hadoop 2 by tweaking the file *placementmonitor.java*
> and *blockmover.java* where *placementmonitor.java* monitors the
> placement process and calls *blockmover.java* when the placement is not
> according the strategy. However, I can't find any such file analogous to
> *placementmonitor.java* in Hadoop 3 although I think that the closest
> file which performs this function is *balancer.java* located in
> hadoop-hdfs-project. Can anybody please provide me more information on this
> front?
>
>
> Warm Regards,
>
> Shuubham Ojha
>
> University Of Melbourne,
>
> Victoria, Australia- 3010
>


Re: Question about KMS

2018-12-10 Thread Wei-Chiu Chuang
Hi Xiaodong,

Generally speaking, admin operations are not in DistributedFileSystem class.
Some of the admin APIs may be found in HdfsAdmin (erasure coding, storage
policy APIs)

In this case, KeyProvider#createKey() does exactly what you want.

On Mon, Dec 10, 2018 at 6:58 PM  wrote:

> hello, everyone:
>
> Why there is no Java API for command  "hadoop key create" in
> DistributedFileSystem?
>
>
>
>
>
> 胡晓东 huxiaodong
>
>
> 网管及服务系统部 Network Management & Service System Dept
>
>
>
>
>
> MP: 17351011636
> E: hu.xiaod...@zte.com.cn
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org


Re: HDFS DirectoryScanner is bothering me

2018-12-04 Thread Wei-Chiu Chuang
Do you have a heapdump? Without a heapdump it's not easy to definitely
point to DirectoryScanner for GC issues.

That said, I did notice DirectoryScanner holding global lock for quite a
few seconds periodically, but that's unrelated. to GC.

On Thu, Nov 29, 2018 at 12:56 AM Yen-Onn Hiu  wrote:

> hi all,
>
> I am on hadoop hdfs version of Hadoop 2.6.0-cdh5.8.0. I discovered that
> the DirectoryScanner is keep causing java GC and slow down the hadoop
> nodes. Digging into the log file I discovered this:
>
> 2018-11-29 13:34:37,995 INFO
> org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool
> BP-1850109981-192.168.1.1-1413178082983 Total blocks: 3896197, missing
> metadata files:214, missing block files:214, missing blocks in memory:103,
> mismatched blocks:0
>
> Reading from internet posting, there are postings saying this is from
> DirectoryScanner which will be executed in every 6 hours. This directory
> scanning caused GC hiccup in all nodes and caused performance issues on the
> cluster.
>
> Question: when I am doing the hdfs dfsadmin -report. It does not say that
> I have any corrupted files. Also, I did the hdfs fsck / onto the directory
> and it does not yield any problems. How can I know what is the missing
> block files, missing blocks in memory and missing metadata files?
>
>
> Thanks!
>
> --
> Hiu Yen Onn
>
>
>


Re: spark structured streaming jobs working in HDP2.6 fail in HDP3.0

2018-08-30 Thread Wei-Chiu Chuang
Hi Lian, I don't know much about Spark structured streaming, but judging
from the stacktrace, you're application was trying to access
HftpFileSystem, which is removed in Apache Hadoop 3. Most likely it is
removed in HDP3.0 too (Hortonworks folks can confirm)
This is documented in CDH6.0 release note:
https://www.cloudera.com/documentation/enterprise/6/release-notes/topics/rg_cdh_600_incompatible_changes.html#hadoop_600_ic

Please use webhdfs or httpfs instead.

On Thu, Aug 30, 2018 at 9:36 AM Lian Jiang  wrote:

> I am using HDP3.0 which uses HADOOP3.1.0 and Spark 2.3.1. My spark
> streaming jobs running fine in HDP2.6.4 (HADOOP2.7.3, spark 2.2.0) fails in
> HDP3:
>
> java.lang.IllegalAccessError: class
> org.apache.hadoop.hdfs.web.HftpFileSystem cannot access its superinterface
> org.apache.hadoop.hdfs.web.TokenAspect$TokenManagementDelegator
>
> at java.lang.ClassLoader.defineClass1(Native Method)
>
> at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
>
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
>
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
>
> at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
>
> at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
>
> at java.security.AccessController.doPrivileged(Native Method)
>
> at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>
> at java.lang.Class.forName0(Native Method)
>
> at java.lang.Class.forName(Class.java:348)
>
> at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:370)
>
> at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
>
> at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
>
> at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:3268)
>
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3313)
>
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
>
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
>
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
>
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
>
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
>
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
>
> at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.initialize(LineRecordReader.java:85)
>
> at
> org.apache.spark.sql.execution.datasources.HadoopFileLinesReader.(HadoopFileLinesReader.scala:46)
>
> at
> org.apache.spark.sql.execution.datasources.json.TextInputJsonDataSource$.readFile(JsonDataSource.scala:125)
>
> at
> org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:132)
>
> at
> org.apache.spark.sql.execution.datasources.json.JsonFileFormat$$anonfun$buildReader$2.apply(JsonFileFormat.scala:130)
>
> at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)
>
> at
> org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:132)
>
> at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org
> $apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:128)
>
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:182)
>
> at
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
>
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
> Source)
>
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
>
> at
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:216)
>
> at
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:108)
>
> at
> org.apache.spark.sql.execution.SortExec$$anonfun$1.apply(SortExec.scala:101)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:830)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>
> at org.apache.spark.scheduler.Task.run(Task.scala:109)
>
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>
> at
> java.util.concurrent.Thread

Re: HDFS User impersonation on encrypted zone | Ranger KMS

2018-08-02 Thread Wei-Chiu Chuang
Hi, this is a supported use case.
Please make sure you configure the KMS proxy user correctly as well (it is
separately from HDFS proxy user settings)
https://hadoop.apache.org/docs/current/hadoop-kms/index.html#KMS_Proxyuser_Configuration

On Thu, Aug 2, 2018 at 12:30 PM Ashish Tadose 
wrote:

> Hi,
>
> Does HDFS user impersonation work on HDFS encrypted zone backed by ranger
> KMS?
>
> Our Hadoop environment configured with Kerberos and also supports creating
> an encrypted zone in HDFS by ranger KMS.
>
> Specific application id has HDFS user impersonation access to impersonate
> users of a certain group which works flawlessly on normal HDFS folders,
> however same not working on encrypted zones.
>
> PFB - Masked log extract
>
> WARN kms.LoadBalancingKMSClientProvider: KMS provider at [/kms/v1/]
> threw an IOException!! java.io.IOException:
> org.apache.hadoop.security.authentication.client.AuthenticationException:
> Authentication failed, URL:
> /kms/v1/keyversion/%400/_eek?eek_op=decrypt&doAs=&
> user.name=, status: 403, message: Forbidden
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:551)
> at
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.decryptEncryptedKey(KMSClientProvider.java:831)
> at
> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:207)
> at
> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider$5.call(LoadBalancingKMSClientProvider.java:203)
> at
> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.doOp(LoadBalancingKMSClientProvider.java:95)
> at
> org.apache.hadoop.crypto.key.kms.LoadBalancingKMSClientProvider.decryptEncryptedKey(LoadBalancingKMSClientProvider.java:203)
> at
> org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.decryptEncryptedKey(KeyProviderCryptoExtension.java:388)
> at
> org.apache.hadoop.hdfs.DFSClient.decryptEncryptedDataEncryptionKey(DFSClient.java:1393)
> at
> org.apache.hadoop.hdfs.DFSClient.createWrappedInputStream(DFSClient.java:1463)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:333)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:327)
> at
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
> at
> org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:340)
> at com.wandisco.fs.client.ReplicatedFC.open(ReplicatedFC.java:752)
> at com.wandisco.fs.client.ReplicatedFC.xlateAndOpen(ReplicatedFC.java:377)
> at com.wandisco.fs.client.FusionHdfs.open(FusionHdfs.java:452)
> at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:786)
> at .EncryptFsTest.readFile(EncryptFsTest.java:118)
> at .EncryptFsTest$1.run(EncryptFsTest.java:71)
> at .kerberos.EncryptFsTest$1.run(EncryptFsTest.java:69)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
>
> Thanks in advance.
>
> Regards,
> Ashish
>
> --
> A very happy Hadoop contributor
>


Re: Hadoop impersonation not handling permissions

2018-07-30 Thread Wei-Chiu Chuang
Pretty sure this is the expected behavior.
>From the stacktrace, you're impersonation is configured correctly (i.e. it
successfully perform operation on behalf of user b) the problem is your
file doesn't allow b to access it.

On Mon, Jul 30, 2018 at 1:25 PM Harinder Singh <
harindersinghbedi...@gmail.com> wrote:

> Hi I am using hadoop proxy user/impersonation to access a directory on
> which the superuser has access, but it's giving me permission errors when
> the proxy user tries to access it:
>
> Say user "a" is a superuser and "b" is trying to access a directory on
> behalf of it. But "b" does not have permission on the directory, user "a"
> does have permissions. So shouldn't "b" be able to access that directory as
> well? Below is the exception I am getting:
>
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker:invoke 11: Exception <-
> abc-cdh-n1/192.168.*.*:8020: getListing
> {org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException):
> Permission denied: user=b, access=READ_EXECUTE,
> inode="/foo/one":hdfs:supergroup:drwx--
> at
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkFsPermission(DefaultAuthorizationProvider.java:279)
> at
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.check(DefaultAuthorizationProvider.java:260)
> at
> org.apache.hadoop.hdfs.server.namenode.DefaultAuthorizationProvider.checkPermission(DefaultAuthorizationProvider.java:168)
> at
> org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.checkPermission(FSPermissionChecker.java:152)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3530)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPermission(FSDirectory.java:3513)
> at
> org.apache.hadoop.hdfs.server.namenode.FSDirectory.checkPathAccess(FSDirectory.java:3484)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkPathAccess(FSNamesystem.java:6624)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListingInt(FSNamesystem.java:5135)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getListing(FSNamesystem.java:5096)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getListing(NameNodeRpcServer.java:888)
> at
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getListing(AuthorizationProviderProxyClientProtocol.java:336)
> at
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getListing(ClientNamenodeProtocolServerSideTranslatorPB.java:630)
> at
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:415)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211)
>
>
> My superuser is hdfs and I am using
> UserGroupInformation.loginUserFromKeytabAndReturnUGI(user, keyTabPath) with
> the hdfs principal in place of user and I don't have ACL's enabled. I have
> added the proxy user's settings as well. * for hdfs.
>
> So can someone guide me what am I missing here?
>
> --
> A very happy Hadoop contributor
>


Re: Security problem extra

2018-06-27 Thread Wei-Chiu Chuang
Hi Zongtian,
This is definitely not a JDK issue. This is the wire-protocol compatibility
between client and server (DataNode).

bq. what the client mean, it mean the application running on hdfs, how does
it have a encryption?
I'm not quite sure what you asked. HDFS supports at-rest encryption, data
transfer encryption, RPC encryption and SSL encryption.

I'd recommend you to make sure your Hadoop client version is the same as
the server version. The log message suggests the DataNode is on Hadoop
2.7.0+ version.

On Wed, Jun 27, 2018 at 2:24 AM ZongtianHou  wrote:

> Does anyone have some clue about it? I have updated the jdk, and still
> cannot solve the problem. Thx advance for any info!!
>
> On 27 Jun 2018, at 12:23 AM, ZongtianHou  wrote:
>
> This is the log info: org.apache.hadoop.hdfs.server.datanode.DataNode:
> Failed to read expected encryption handshake from client at /
> 127.0.0.1:53611. Perhaps the client is running an  older version of
> Hadoop which does not support encryption
>
> I have two more questions here.
> 1 what the client mean, it mean the application running on hdfs, how does
> it have a encryption?
> 2 I have turn off the encryption about data transfer, rpc protection, http
> protection by setting properties of  hadoop.rpc.protection, 
> dfs.encrypt.data.transfer
> and dfs.http.policy as false, why there is still encryption?
>
> Any clue will be appreciated.
>
>
>

-- 
A very happy Clouderan


Re: Automatic Failover to different Data Center.

2018-05-07 Thread Wei-Chiu Chuang
Distcp is a backup tool, not a synchronization tool.
At best, you get a point-in-time snapshot of the DC1. For example, a period
schedule of distcp every night at 12am. But in case of total failure, you
lose everything from that point in time.


On Mon, May 7, 2018 at 12:30 AM, akshay naidu 
wrote:

> Hello Hadoopers,
> I am planning for a Disaster Recovery(DR) project mainly for *hadoop
> clusters*.
> Infrastructure is in a DataCenter in West say DC1 . I Have created a
> backup hadoop-spark cluster in DataCenter in east say DC2. With Distcp will
> keep DC2  synchd with DC1 . This will work as DR .
>
> But what I want is that in case when DC1 went down completely, the
> automatic failover should happen and without any or very very less downtime
> DC2 is live.
>
> I have configured *hadoop high availability* and *Automatic Failover *in
> hadoop cluster in DC1 and it works fine. But that won't help in case whole
> DC1 goes down.
>
> Is there a solution where I can keep two hadoop clusters running in
> parallel, completely synchd, in two different DataCenters. In case hadoop
> cluster in DC1 goes down , Automatic failover occurs to DC2.
>
> Any hint would be of great help, any feedback, positive or negative, will
> be a great help.
>
> Thanks .
>



-- 
A very happy Clouderan


Re: Kerberos auth + user impersonation

2018-01-25 Thread Wei-Chiu Chuang
Hi Near,

Try setting proxyuser using with following doc:
https://www.cloudera.com/documentation/enterprise/latest/topics/admin_hdfs_proxy_users.html

A while ago I helped a customer of us to configure proxy user. If you have
at-rest encryption in the cluster, you'd also need to configure KMS
proxyuser as well.
https://www.cloudera.com/documentation/enterprise/latest/topics/cdh_sg_kms_security.html
It
isn't that obvious from CDH documentation nor Apache Hadoop doc.


On Thu, Jan 25, 2018 at 7:24 AM, Bear Giles  wrote:

> Hi, kerberos auth question here.
>
> We need to have Kerberos authentication with user impersonation. I know we
> had it working on one of our test clusters earlier but nobody can remember
> which one or how it was configured. :-(
>
> From my research I have the following items:
>
> 1. There is are Kerberos users alice@REALM and bob@REALM.
>
> 2. 'alice' is in the 'supergroup' group on the HDFS node I access.
>
> 3. The server has hadoop.proxyuser.alice.users = * set in core-site.xml.
> (see note)
>
> 4. I can connect using alice@REALM.
>
> 5. When I try to connect using UGI.createProxyUser("bob", alice) I get a
> "Client cannot authenticate via:[TOKEN, KERBEROS]" error.
>
> 6. I didn't have success with "bob@REALM" earlier but I've change the
> configuration since then so I might have missed something.
>
> Do I need to create an additional principal for alice? Something like
> alice/hdfs@REALM? alice/supergroup@REALM?
>
> Is there
>
> (note: We're using CDH and I'm setting this on the 'advanced configuration
> snippets' page. I saved the settings and restarted the servers but I'm not
> sure that the files are actually being updated. I've also changed the
> configuration files manually.)
>
> --
>
> Bear Giles
>
> Sr. Java Application Engineer
> bgi...@snaplogic.com
> Mobile: 720-749-7876 <(720)%20749-7876>
>
>
> 
>
>
>
> *SnapLogic Inc | 929 Pearl St #200
>  |
> 80303 CO 80302 | USA*
>
> *SnapLogic Inc | 2 W 5th Avenue 4th Floor | San Mateo CA 94402 | USA
> 
>   *
>
> This message is confidential. It may also be privileged or otherwise
> protected by work product immunity or other legal rules. If you have
> received it by mistake, please let us know by e-mail reply and delete it
> from your system; you may not copy this message or disclose its contents to
> anyone. The integrity and security of this message cannot be guaranteed on
> the Internet.
>



-- 
A very happy Hadoop contributor


Re: Aws EMR Hadoop Web Access

2018-01-09 Thread Wei-Chiu Chuang
There's a project called Apache Knox that seems to offers what you need.

https://hortonworks.com/apache/knox-gateway/


On Tue, Jan 9, 2018 at 2:20 PM, Jhon Anderson Cardenas Diaz <
jhonderson2...@gmail.com> wrote:

> According to aws documentation for EMR web access:
>
>
>
> *Setup Web Connection Hadoop, Ganglia, and other applications publish user
> interfaces as web sites hosted on the master node. For security reasons,
> these web sites are only available on the master node's local web server.To
> reach the web interfaces, you must establish an SSH tunnel with the master
> node using either dynamic or local port forwarding. If you establish an SSH
> tunnel using dynamic port forwarding, you must also configure a proxy
> server to view the web interfaces.*
>
> Are you planning in short or long term create a new feature that allows
> the web access to hadoop web resources without all this manual
> configuration?, maybe you managing the proxy redirections and all those
> things in such a way that user does not have to expose the EMR publically.
>
> Thanks.
>



-- 
A very happy Hadoop contributor


Re: UserGroupInformation and Kerberos

2018-01-02 Thread Wei-Chiu Chuang
Hi Jorge,

If you use Hadoop library as a client, and your first login using key is
via UserGroupInformation#loginUserFromKeytab(), the client automatically
relogins again using keytab when it gets an exception (see
o.a.h.ipc.Client#handleSaslConnectionFailure).

Note: using UserGroupInformation.loginUserFromSubject() won't do the same.
It is used when you have a valid tgt.

On Tue, Jan 2, 2018 at 11:40 AM, Jorge Machado  wrote:

> Hey everyone, I was working with UserGroupInformation Class and Kerberos.
>
>  Is there a proper example how to renew the Kerkebros Ticket from a keytab
> ?
>
> For Example:
>
> assuming that  I have the jaas.config set in the jvm I do:
>
> UserGroupInformation loginUser = UserGroupInformation.getLoginUser();
> This will login the user but not using a keytab.
>
> Using this code it will login with Kerberos:
> UserGroupInformation.setConfiguration(conf);
> Krb5LoginModule context = new Krb5LoginModule();
> Subject subject = new Subject();
> javax.security.auth.login.Configuration jconf = javax.security.auth.login.
> Configuration.getConfiguration();
> AppConfigurationEntry entries[] = jconf.getAppConfigurationEntry("
> Client");
> context.initialize(subject,null, new HashMap(),
> entries[0].getOptions());
> context.login();
> context.commit();
> UserGroupInformation.loginUserFromSubject(subject);
>
>
> How Do I make sure that my Keytab get’s renewed ? I think Hadoop Libraries
> should take of this. I can count  a lot of projects implementing their own
> TicketRewener…
>
> Any suggestions here ?
>
> Thanks
>
>
> Jorge Machado
>
>
>
>
>
>
>


-- 
A very happy Hadoop contributor


Re: Hive - Json Serde - ORC

2017-12-06 Thread Wei-Chiu Chuang
Hi I think you are better off asking this question at the hive mailing list.

Best

On Wed, Dec 6, 2017 at 6:43 AM, kaducangica .  wrote:

> Hi all,
>
> i have a very complex json that i need to insert in a hive table. A json
> example follws attached.
>
> First of all i read a json file with Spark to make some data processing
> and then i write to a stage table with no Serde and with no any kind of
> compression and format.
>
> Then i do an INSERT/SELECT into the "jsonTable" (create table attached)
> with no problems. This table use a json Serde 
> (org.openx.data.jsonserde.JsonSerDe)
> and a ORC format and is also particioned by date and timezone.
>
> The problem is that after all this process every time a try to make a
> simple "select * from jsonTable" query i got this error message:
>
> "Failed with exception java.io.IOException:java.io.IOException: Error
> reading file: hdfs://ip-xxx-xxx-xxx-xxx.sa-east-1.compute.int
> ernal:8020/user/hive/warehouse/jsonTable/data_posicao_short=
> 2017-12-02/veitimezone=America-Sao_Paulo/00_0"
>
> Actually i do not know if it is possible to use Serde, ORC and partition
> in the same table.
>
> Someone could help me?
>
> Thanks in advance.
> Best regards
>
> Carlos.
>
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: user-h...@hadoop.apache.org
>



-- 
A very happy Clouderan


Re: Sqoop and kerberos ldap hadoop authentication

2017-09-07 Thread Wei-Chiu Chuang
Hi,
The message "User xxx not found" feels more like group mapping error. Do
you have the relevant logs?

Integrating AD with Hadoop can be non-trivial, and Cloudera's general
recommendation is to use third party authentication integrator like SSSD or
Centrify, instead of using LdapGroupsMapping.

Hope that helps,
Wei-Chiu

On Thu, Sep 7, 2017 at 1:09 AM, dna_29a  wrote:

> Hi,
> I want to run sqoop jobs under kerberos authentication. If I have a ticket
> for local Kerberos user (local KDC and user exists as linux user on each
> host), sqoop works fine. Also, Kerberos uses cross-realm trust and accept
> Active Directory authentication. In this case, if I got ticket fot AD KDC
> user, sqoop jobs fails with message "User xxx not found". That means that
> AD user does not exist on each host of hadoop. After creating user on each
> host it work fine.
>
> In order to perform SSO principe and not to have a headache mantaining
> thouthand of users on hadoop hosts, is it possible to configure Sqoop to
> work with Active Directory KDC users?
>
> Thanks!
>
>
> Отправлено с устройства Samsung.
>



-- 
A very happy Clouderan


Re: SocketTimeoutException in DataXceiver

2016-12-20 Thread Wei-Chiu Chuang
This looks like a general issue, and there are multiple possible
explanations.
It could be either a flaky NIC, or flaky network switches.

On the other hand, if the DataNode is busy and all dataXceiver threads are
used (by default: 4096 threads), this error may also be seen at the client
side. Take a look at your DataNode log and see if you spot error messages
like "Xceiver count 4097  exceeds the limit of concurrent xcievers: 4096".
If this is the case, try to increase dfs.datanode.max.transfer.threads.
Depending on the incoming traffic and application, you might double or even
quadruple that number.

Your second error message is interesting. It might be corrupt blocks on
DataNodes (either hardware or software -- a few known bugs can lead to
this. I haven't checked if they are fixed in Hadoop 2.7.3. But there could
be other undiscovered bugs). It might due to a unresponsive DN (garbage
collection pause, kernel pause -- there are a few scenarios where kernel
could pause a process). You will need to look at datanode logs and kernel
dmesg log to understand why, and it often is time-consuming.


On Tue, Dec 20, 2016 at 8:11 AM, Joseph Naegele  wrote:

> Hi folks,
>
> I'm experiencing the exact symptoms of HDFS-770 (
> https://issues.apache.org/jira/browse/HDFS-770) using Spark and a basic
> HDFS deployment. Everything is running locally on a single machine. I'm
> using Hadoop 2.7.3. My HDFS deployment consists of a single 8 TB disk with
> replication disabled, otherwise everything is vanilla Hadoop 2.7.3. My
> Spark job uses a Hive ORC writer to write a  dataset to disk. The dataset
> itself is < 100 GB uncompressed, ~17 GB compressed.
>
> It does not appear to be a Spark issue. The datanode's logs show it
> receives the first ~500 packets for a block, then nothing for a minute,
> then the default channel read timeout of 6 ms causes the exception:
>
> 2016-12-19 18:36:50,632 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
> opWriteBlock BP-1695049761-192.168.2.211-1479228275669:blk_1073957413_216632
> received exception java.net.SocketTimeoutException: 6 millis timeout
> while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
> 2016-12-19 18:36:50,632 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode:
> lamport.grierforensics.com:50010:DataXceiver error processing WRITE_BLOCK
> operation  src: /127.0.0.1:55866 dst: /127.0.0.1:50010
> java.net.SocketTimeoutException: 6 millis timeout while waiting for
> channel to be ready for read. ch : java.nio.channels.SocketChannel[connected
> local=/127.0.0.1:50010 remote=/127.0.0.1:55866]
> at org.apache.hadoop.net.SocketIOWithTimeout.doIO(
> SocketIOWithTimeout.java:164)
> at org.apache.hadoop.net.SocketInputStream.read(
> SocketInputStream.java:161)
> ...
>
> On the Spark side, all is well until the datanode's socket exception
> results in Spark experiencing a DFSOutputStream ResponseProcessor
> exception, followed by Spark aborting due to all datanodes being bad:
>
> 2016-12-19 18:36:59.014 WARN DFSClient: DFSOutputStream ResponseProcessor
> exception  for block BP-1695049761-192.168.2.211-
> 1479228275669:blk_1073957413_216632
> java.io.EOFException: Premature EOF: no length prefix available
> at org.apache.hadoop.hdfs.protocolPB.PBHelper.
> vintPrefixed(PBHelper.java:2203)
> at org.apache.hadoop.hdfs.protocol.datatransfer.
> PipelineAck.readFields(PipelineAck.java:176)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$
> ResponseProcessor.run(DFSOutputStream.java:867)
>
> ...
> Caused by: java.io.IOException: All datanodes 127.0.0.1:50010 are bad.
> Aborting...
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> setupPipelineForAppendOrRecovery(DFSOutputStream.java:1206)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> processDatanodeError(DFSOutputStream.java:1004)
> at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.
> run(DFSOutputStream.java:548)
>
> I haven't tried adjusting the timeout yet for the same reason specified by
> the reporter of HDFS-770: I'm running everything locally, with no other
> tasks running on the system so why would I need a socket read timeout
> greater than 60 seconds? I haven't observed any CPU, memory or disk
> bottlenecks.
>
> Lowering the number of cores used by Spark does help alleviate the
> problem, but doesn't eliminate it, which led me to believe the issue may be
> disk contention (i.e. too many client writers?), but again, I haven't
> observed any disk IO bottlenecks at all.
>
> Does anyone else still experience HDFS-770 (https://issues.apache.org/
> jira/browse/HDFS-770) and is there a general approach/solution?
>
> Thanks
>
> ---
> Joe Naegele
> Grier Forensics
>
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@hadoop.apache.org
> For additional

Re: Encrypt a directory using some key (JAVA)

2016-12-14 Thread Wei-Chiu Chuang
Hi 
If you have access to Hadoop codebase, take a look at CryptoAdmin class, which 
implements these two commands.
Internally, the commands are implemented via 
DistributedFileSystem#createEncryptionZone and 
DistributedFileSystem#listEncryptionZones

Regards,
Wei-Chiu Chuang
A very happy Clouderan

> On Dec 14, 2016, at 5:39 AM, Aneela Saleem  wrote:
> 
> Hi,
> 
> I have successfully enables Hadoop with KMS and now I want to write some java 
> code to create key, get keys and encrypt a directory using a key. In other 
> words, I want to translate this command
> 
> hdfs hdfs crypto -createZone -keyName  -path /encryption_zone
> and 
> hdfs hdfs crypto -listZones
> 
> into java code. 
> 
> 
> Any suggestions will be appreciated.
> 
> Thanks



Re: Secure Hadoop - invalid Kerberos principal errors

2016-10-20 Thread Wei-Chiu Chuang
Instead of specifying host name of server principal,
have you tried to use hdfs/_h...@tnbsound.com?

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html#Kerberos_principals_for_Hadoop_Daemons
 
<http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html#Kerberos_principals_for_Hadoop_Daemons>

> dfs.journalnode.kerberos.principal
> hdfs/aw1hdnn001.tnbsound@tnbsound.com

Wei-Chiu Chuang
A very happy Clouderan

> On Oct 20, 2016, at 10:19 AM, Mark Selby  wrote:
> 
> We have an existing CDH 5.5.1 cluster with simple authentication and no 
> authorization. We are building out a new cluster and plan to move to CDH 
> 5.8.2 wiith Kerberos based authentication. We have an existing MIT Kerberos 
> infrastructure which we sucessfully use for a variety of services. 
> (ssh,apache,postfix)
> 
> I am very confident that out /etc/krb5.conf and name resolution is working. I 
> have even used HadoopDNSVerifier-1.0.jar to verify that java sees the same 
> name canonicalization that we see.
> 
> I have built and test cluster and closely followed the instructions on the 
> secure hadoop install doc from the clodera site making sure that all the conf 
> files are properly edited and all the Kerberos keytabs contain the correct 
> principals and have the correct permissions.
> 
> We are using HA namenodes with Quorm based journalmanagers
> 
> I am running into a persistent problem with many hadoop compents when they 
> need to talk securely to remote servers. The two example that I post here are 
> the namenode needing to talk to remote journalnodes and command line hdfs 
> client needing to speak to a remote namenode. Both give the same error
> 
> Server has invalid Kerberos principal: 
> hdfs/aw1hdnn002.tnbsound@tnbsound.com; Host Details : local host is: 
> "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: 
> "aw1hdnn002.tnbsound.com":8020;
> 
> There is not much on the inter-webs about this and the error that is showing 
> up is leading me to belive that the issue is aroung the kerberos realm being 
> used in one place and not the other.
> 
> I just can not seem to figure out what is going on here as I know these are 
> vaild principals. I have added a snippet at the end where I have enabled 
> kerberos debugging to see if that helps at all
> 
> The weird part is that this error applies only to remote daemons. The local 
> namenode and journal node does not have the issue. We can “speak” locally but 
> not remotely.
> 
> All and Any help is greatly appreciated
> 
> #
> # This is me with hdfs kerberos credentials trying to run hdfs dfsadmin 
> -refreshServiceAcl
> #
> 
> hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 53$ klist
> Ticket cache: FILE:/tmp/krb5cc_115
> Default principal: hdfs/aw1hdnn001.tnbsound@tnbsound.com
> Valid starting Expires Service principal
> 10/20/2016 15:34:49 10/21/2016 15:34:49 krbtgt/tnbsound@tnbsound.com
> renew until 10/27/2016 15:34:49
> 
> hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 54$ hdfs dfsadmin -refreshServiceAcl
> Refresh service acl successful for aw1hdnn001.tnbsound.com/10.132.8.19:8020
> refreshServiceAcl: Failed on local exception: java.io.IOException: 
> java.lang.IllegalArgumentException: Server has invalid Kerberos principal: 
> hdfs/aw1hdnn002.tnbsound@tnbsound.com; Host Details : local host is: 
> "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: 
> "aw1hdnn002.tnbsound.com":8020;
> 
> #
> # This is the namenode trying to start up and contant and off server 
> jornalnode
> #
> 2016-10-20 16:51:40,703 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:hdfs/aw1hdnn001.tnbsound@tnbsound.com 
> (auth:KERBEROS) cause:java.io.IOException: 
> java.lang.IllegalArgumentException: Server has invalid Kerberos principal: 
> hdfs/aw1hdrm001.tnbsound@tnbsound.com
> 10.132.8.21:8485: Failed on local exception: java.io.IOException: 
> java.lang.IllegalArgumentException: Server has invalid Kerberos principal: 
> hdfs/aw1hdrm001.tnbsound@tnbsound.com; Host Details : local host is: 
> "aw1hdnn001.tnbsound.com/10.132.8.19"; destination host is: 
> "aw1hdrm001.tnbsound.com":8485; 
> 
> #
> # This is me with hdfs kerberos credentials trying to run hdfs dfsadmin 
> -refreshServiceAcl with debug into
> #
> hdfs@aw1hdnn001 /var/lib/hadoop-hdfs 46$ 
> HADOOP_OPTS="-Dsun.security.krb5.debug=true" hdfs dfsadmin -refreshServiceAcl
> Java config name: null
> Native config name: /etc/krb5.conf
> Loaded from native config
> >>>KinitOptions cache name is /tmp/krb5cc_115
> >>>DEBUG  client pri

Re: Where does Hadoop get username and group mapping from for linux shell username and group mapping?

2016-10-14 Thread Wei-Chiu Chuang
If you want to drill down a bit, I recommend read this doc too: 
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/GroupsMapping.html
 
<http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/GroupsMapping.html>
This is for trunk Hadoop 3.0, but most of it applies to 2.7/2.8

Wei-Chiu Chuang
A very happy Clouderan

> On Oct 14, 2016, at 11:33 AM, Ravi Prakash  wrote:
> 
> Chen! 
> 
> It gets it from whatever is configured on the Namenode. 
> https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Group_Mapping
>  
> <https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html#Group_Mapping>
> 
> HTH
> Ravi
> 
> On Thu, Oct 13, 2016 at 7:43 PM, chen dong  <mailto:chendong...@gmail.com>> wrote:
> Hi, 
> 
> Currently I am working on a project to enhance the security for the Hadoop 
> cluster. Eventually I will use Kerberos and Sentry for authentication and 
> authorisation. And the username and group mapping will come from AD/LDAP (?), 
> I think so. 
> 
> But now I am just learning and trying. I have a question and I haven’t figure 
> it out is
> 
> where the username/group mapping information come from? 
> 
> As far as I know there is no username and group name for Hadoop and username 
> and group name come from the client wherever from local client machine or 
> Kerberos realm. But it is a little bit vague for me and can I get the 
> implementation details here? 
> 
> Is this information from the machine where HDFS client is located or from the 
> linux shell username and group on name node?  Or it depends on the context - 
> even related to data node? What if the data nodes and name nodes have 
> different users or user-group mapping in the local boxes. 
> 
> Regards,
> 
> Dong
> 
> 



Re: Authentication Failure talking to Ranger KMS

2016-10-11 Thread Wei-Chiu Chuang
Somes to me you encountered this bug? HDFS-10481 
<https://issues.apache.org/jira/browse/HDFS-10481>
If you’re using CDH, this is fixed in CDH5.5.5, CDH5.7.2 and CDH5.8.2

Wei-Chiu Chuang
A very happy Clouderan

> On Oct 11, 2016, at 8:38 AM, Benjamin Ross  wrote:
> 
> All,
> I'm trying to use httpfs to write to an encryption zone with security off.  I 
> can read from an encryption zone, but I can't write to one.
> 
> Here's the applicable namenode logs.  httpfs and root both have all possible 
> privileges in the KMS.  What am I missing?
> 
> 
> 2016-10-07 15:48:16,164 DEBUG ipc.Server 
> (Server.java:authorizeConnection(2095)) - Successfully authorized userInfo {
>   effectiveUser: "root"
>   realUser: "httpfs"
> }
> protocol: "org.apache.hadoop.hdfs.protocol.ClientProtocol"
> 
> 2016-10-07 15:48:16,164 DEBUG ipc.Server (Server.java:processOneRpc(1902)) -  
> got #2
> 2016-10-07 15:48:16,164 DEBUG ipc.Server (Server.java:run(2179)) - IPC Server 
> handler 9 on 8020: org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 
> 10.41.1.64:47622 Call#2 Retry#0 for RpcKind RPC_PROTOCOL_BUFFER
> 2016-10-07 15:48:16,165 DEBUG security.UserGroupInformation 
> (UserGroupInformation.java:logPrivilegedAction(1751)) - PrivilegedAction 
> as:root (auth:PROXY) via httpfs (auth:SIMPLE) 
> from:org.apache.hadoop.ipc.Server$Handler.run(Server.java:2205)
> 2016-10-07 15:48:16,166 DEBUG hdfs.StateChange 
> (NameNodeRpcServer.java:create(699)) - *DIR* NameNode.create: file 
> /tmp/cryptotest/hairyballs for DFSClient_NONMAPREDUCE_-1005188439_28 at 
> 10.41.1.64
> 2016-10-07 15:48:16,166 DEBUG hdfs.StateChange 
> (FSNamesystem.java:startFileInt(2411)) - DIR* NameSystem.startFile: 
> src=/tmp/cryptotest/hairyballs, holder=DFSClient_NONMAPREDUCE_-1005188439_28, 
> clientMachine=10.41.1.64, createParent=true, replication=3, createFlag=[CREATE
> , OVERWRITE], blockSize=134217728, 
> supportedVersions=[CryptoProtocolVersion{description='Encryption zones', 
> version=2, unknownValue=null}]
> 2016-10-07 15:48:16,167 DEBUG security.UserGroupInformation 
> (UserGroupInformation.java:logPrivilegedAction(1751)) - PrivilegedAction 
> as:hdfs (auth:SIMPLE) 
> from:org.apache.hadoop.crypto.key.kms.KMSClientProvider.createConnection(KMSClientProvider.java:484)
> 2016-10-07 15:48:16,171 DEBUG client.KerberosAuthenticator 
> (KerberosAuthenticator.java:authenticate(205)) - Using fallback authenticator 
> sequence.
> 2016-10-07 15:48:16,176 DEBUG security.UserGroupInformation 
> (UserGroupInformation.java:doAs(1728)) - PrivilegedActionException as:hdfs 
> (auth:SIMPLE) 
> cause:org.apache.hadoop.security.authentication.client.AuthenticationException:
>  Authentication failed, status: 403, messag
> e: Forbidden
> 2016-10-07 15:48:16,176 DEBUG ipc.Server (ProtobufRpcEngine.java:call(631)) - 
> Served: create queueTime= 2 procesingTime= 10 exception= IOException
> 2016-10-07 15:48:16,177 DEBUG security.UserGroupInformation 
> (UserGroupInformation.java:doAs(1728)) - PrivilegedActionException as:root 
> (auth:PROXY) via httpfs (auth:SIMPLE) cause:java.io.IOException: 
> java.util.concurrent.ExecutionException: java.io.IOException: org.apach
> e.hadoop.security.authentication.client.AuthenticationException: 
> Authentication failed, status: 403, message: Forbidden
> 2016-10-07 15:48:16,177 INFO  ipc.Server (Server.java:logException(2299)) - 
> IPC Server handler 9 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.create from 10.41.1.64:47622 
> Call#2 Retry#0
> java.io.IOException: java.util.concurrent.ExecutionException: 
> java.io.IOException: 
> org.apache.hadoop.security.authentication.client.AuthenticationException: 
> Authentication failed, status: 403, message: Forbidden
> at 
> org.apache.hadoop.crypto.key.kms.KMSClientProvider.generateEncryptedKey(KMSClientProvider.java:750)
> at 
> org.apache.hadoop.crypto.key.KeyProviderCryptoExtension.generateEncryptedKey(KeyProviderCryptoExtension.java:371)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.generateEncryptedDataEncryptionKey(FSNamesystem.java:2352)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2478)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2377)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:716)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:405)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingM

Re: native snappy library not available: this version of libhadoop was built without snappy support.

2016-10-04 Thread Wei-Chiu Chuang
It seems to me this issue is the direct result of MAPREDUCE-6577 
<https://issues.apache.org/jira/browse/MAPREDUCE-6577>
Since you’re on a CDH cluster, I would suggest you to move up to CDH5.7.2 or 
above where this bug is fixed.

Best,
Wei-Chiu Chuang

> On Oct 4, 2016, at 1:26 PM, Wei-Chiu Chuang  wrote:
> 
> I see. Sorry for the confusion.
> 
> It seems to me the warning message a bit misleading. This message may also be 
> printed if libhadoop can not be loaded for any reason.
> Can you turn on debug log and see if the log contains either "Loaded the 
> native-hadoop library” or "Failed to load native-hadoop with error”?
> 
> 
> Wei-Chiu Chuang
> 
>> On Oct 4, 2016, at 1:12 PM, Uthayan Suthakar > <mailto:uthayan.sutha...@gmail.com>> wrote:
>> 
>> Hi Wei-Chiu,
>> 
>> My Hadoop version is Hadoop 2.6.0-cdh5.7.0.
>> 
>> But when I checked the native, it shows that it is installed:
>> 
>> hadoop checknative
>> 16/10/04 21:01:30 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
>> native-bzip2 library system-native
>> 16/10/04 21:01:30 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>> native-zlib library
>> Native library checking:
>> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
>> zlib:true /lib64/libz.so.1
>> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
>> lz4: true revision:99
>> bzip2:   true /lib64/libbz2.so.1
>> openssl: true /usr/lib64/libcrypto.so
>> 
>> Thanks.
>> 
>> Uthay
>> 
>> 
>> On 4 October 2016 at 21:05, Wei-Chiu Chuang > <mailto:weic...@cloudera.com>> wrote:
>> Hi Uthayan,
>> what’s the version of Hadoop you have? Hadoop 2.7.3 binary does not ship 
>> with snappy precompiled. If this is the version you have you may have to 
>> rebuild Hadoop yourself to include it.
>> 
>> Wei-Chiu Chuang
>> 
>>> On Oct 4, 2016, at 12:59 PM, Uthayan Suthakar >> <mailto:uthayan.sutha...@gmail.com>> wrote:
>>> 
>>> Hello guys,
>>> 
>>> I have a job that reads compressed (Snappy) data but when I run the job, it 
>>> is throwing an error "native snappy library not available: this version of 
>>> libhadoop was built without snappy support".
>>> .  
>>> I followed this instruction but it did not resolve the issue:
>>> https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html
>>>  
>>> <https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html>
>>> 
>>> The check native command show that snappy is installed.
>>> hadoop checknative
>>> 16/10/04 21:01:30 INFO bzip2.Bzip2Factory: Successfully loaded & 
>>> initialized native-bzip2 library system-native
>>> 16/10/04 21:01:30 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>>> native-zlib library
>>> Native library checking:
>>> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
>>> zlib:true /lib64/libz.so.1
>>> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
>>> lz4: true revision:99
>>> bzip2:   true /lib64/libbz2.so.1
>>> openssl: true /usr/lib64/libcrypto.so
>>> 
>>> I also have a code in the job to check whether native snappy is loaded, 
>>> which is returning true.
>>> 
>>> Now, I have no idea why I'm getting this error. Also, I had no issue 
>>> reading Snappy data using MapReduce job on the same cluster, Could anyone 
>>> tell me what is wrong?
>>> 
>>> 
>>> 
>>> Thank you.
>>> 
>>> Stack:
>>> 
>>> 
>>> java.lang.RuntimeException: native snappy library not available: this 
>>> version of libhadoop was built without snappy support.
>>> at 
>>> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>>> at 
>>> org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
>>> at 
>>> org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
>>> at 
>>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111)
>>> at 
>>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>>> at 
>>> org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>>> at org.apache.spark.rdd.

Re: native snappy library not available: this version of libhadoop was built without snappy support.

2016-10-04 Thread Wei-Chiu Chuang
I see. Sorry for the confusion.

It seems to me the warning message a bit misleading. This message may also be 
printed if libhadoop can not be loaded for any reason.
Can you turn on debug log and see if the log contains either "Loaded the 
native-hadoop library” or "Failed to load native-hadoop with error”?


Wei-Chiu Chuang

> On Oct 4, 2016, at 1:12 PM, Uthayan Suthakar  
> wrote:
> 
> Hi Wei-Chiu,
> 
> My Hadoop version is Hadoop 2.6.0-cdh5.7.0.
> 
> But when I checked the native, it shows that it is installed:
> 
> hadoop checknative
> 16/10/04 21:01:30 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 16/10/04 21:01:30 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: true /usr/lib64/libcrypto.so
> 
> Thanks.
> 
> Uthay
> 
> 
> On 4 October 2016 at 21:05, Wei-Chiu Chuang  <mailto:weic...@cloudera.com>> wrote:
> Hi Uthayan,
> what’s the version of Hadoop you have? Hadoop 2.7.3 binary does not ship with 
> snappy precompiled. If this is the version you have you may have to rebuild 
> Hadoop yourself to include it.
> 
> Wei-Chiu Chuang
> 
>> On Oct 4, 2016, at 12:59 PM, Uthayan Suthakar > <mailto:uthayan.sutha...@gmail.com>> wrote:
>> 
>> Hello guys,
>> 
>> I have a job that reads compressed (Snappy) data but when I run the job, it 
>> is throwing an error "native snappy library not available: this version of 
>> libhadoop was built without snappy support".
>> .  
>> I followed this instruction but it did not resolve the issue:
>> https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html
>>  
>> <https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html>
>> 
>> The check native command show that snappy is installed.
>> hadoop checknative
>> 16/10/04 21:01:30 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
>> native-bzip2 library system-native
>> 16/10/04 21:01:30 INFO zlib.ZlibFactory: Successfully loaded & initialized 
>> native-zlib library
>> Native library checking:
>> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
>> zlib:true /lib64/libz.so.1
>> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
>> lz4: true revision:99
>> bzip2:   true /lib64/libbz2.so.1
>> openssl: true /usr/lib64/libcrypto.so
>> 
>> I also have a code in the job to check whether native snappy is loaded, 
>> which is returning true.
>> 
>> Now, I have no idea why I'm getting this error. Also, I had no issue reading 
>> Snappy data using MapReduce job on the same cluster, Could anyone tell me 
>> what is wrong?
>> 
>> 
>> 
>> Thank you.
>> 
>> Stack:
>> 
>> 
>> java.lang.RuntimeException: native snappy library not available: this 
>> version of libhadoop was built without snappy support.
>> at 
>> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>> at 
>> org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
>> at 
>> org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
>> at 
>> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111)
>> at 
>> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
>> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
>> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>> at 
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>>   

Re: native snappy library not available: this version of libhadoop was built without snappy support.

2016-10-04 Thread Wei-Chiu Chuang
Hi Uthayan,
what’s the version of Hadoop you have? Hadoop 2.7.3 binary does not ship with 
snappy precompiled. If this is the version you have you may have to rebuild 
Hadoop yourself to include it.

Wei-Chiu Chuang

> On Oct 4, 2016, at 12:59 PM, Uthayan Suthakar  
> wrote:
> 
> Hello guys,
> 
> I have a job that reads compressed (Snappy) data but when I run the job, it 
> is throwing an error "native snappy library not available: this version of 
> libhadoop was built without snappy support".
> .  
> I followed this instruction but it did not resolve the issue:
> https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html
>  
> <https://community.hortonworks.com/questions/18903/this-version-of-libhadoop-was-built-without-snappy.html>
> 
> The check native command show that snappy is installed.
> hadoop checknative
> 16/10/04 21:01:30 INFO bzip2.Bzip2Factory: Successfully loaded & initialized 
> native-bzip2 library system-native
> 16/10/04 21:01:30 INFO zlib.ZlibFactory: Successfully loaded & initialized 
> native-zlib library
> Native library checking:
> hadoop:  true /usr/lib/hadoop/lib/native/libhadoop.so.1.0.0
> zlib:true /lib64/libz.so.1
> snappy:  true /usr/lib/hadoop/lib/native/libsnappy.so.1
> lz4: true revision:99
> bzip2:   true /lib64/libbz2.so.1
> openssl: true /usr/lib64/libcrypto.so
> 
> I also have a code in the job to check whether native snappy is loaded, which 
> is returning true.
> 
> Now, I have no idea why I'm getting this error. Also, I had no issue reading 
> Snappy data using MapReduce job on the same cluster, Could anyone tell me 
> what is wrong?
> 
> 
> 
> Thank you.
> 
> Stack:
> 
> 
> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.
> at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
> at 
> org.apache.hadoop.io.compress.SnappyCodec.getDecompressorType(SnappyCodec.java:193)
> at 
> org.apache.hadoop.io.compress.CodecPool.getDecompressor(CodecPool.java:178)
> at 
> org.apache.hadoop.mapred.LineRecordReader.(LineRecordReader.java:111)
> at 
> org.apache.hadoop.mapred.TextInputFormat.getRecordReader(TextInputFormat.java:67)
> at org.apache.spark.rdd.HadoopRDD$$anon$1.(HadoopRDD.scala:237)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:208)
> at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:101)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)



Re: How to speed up the building process of Hadoop?

2016-09-30 Thread Wei-Chiu Chuang
One suggest: add -Dmaven.javadoc.skip=true
This parameter skips building javadocs. For me this reduces overall build time 
to about 2 minutes.


> On Sep 30, 2016, at 5:40 AM, Mohammed Q. Hussian  wrote:
> 
> Hi All. 
> 
> I'm building Hadoop from source using the following Maven command:
> mvn --offline package -Pdist -DskipTests"
> 
> Everything works fine but the problem is that the building process takes 
> time. I'm planing to modify Hadoop's source code and waiting about seven 
> minutes to compile the changes is not suitable. 
> 
> Is there anyway to speed the process up?
> 
> I tried some solutions that presented on the following two links and nothing 
> seems to work:
> 
> https://zeroturnaround.com/rebellabs/your-maven-build-is-slow-speed-it-up/ 
> 
> 
> http://blog.dblazejewski.com/2015/08/how-to-make-your-maven-build-fast-again/ 
> 
> 
> Regards.



Re: Hadoop KMS, security module

2016-09-26 Thread Wei-Chiu Chuang
Hi,
I'm not an expert in Hadoop KMS. But as far as I know Hadoop KMS itself
does not rely on particular hardware for the purpose.
The Hadoop KMS implementation is based on Java Provider API
https://docs.oracle.com/javase/7/docs/api/java/security/Provider.html

It looks like though there is ongoing effort to add HSM into Apache Ranger.



On Sat, Sep 24, 2016 at 6:55 PM, Ascot Moss  wrote:

> Hi,
>
> I am studying Hadoop KMS and encryption, I understand that Hadoop KMS is
> proxy of security module, have some questions and need help:
>
> Q1. Is there a reference list about Hardware Security Modules which
> support Hadoop KMS?
>
> Q2. Any suggestion about (open source) software security modules that can
> be used for evaluation and testing purposes on Hadoop KMS ?
>
> Regards
>
>


Re: hdfs2.7.3 kerberos can not startup

2016-09-20 Thread Wei-Chiu Chuang
You need to run kinit command to authenticate before running hdfs dfs -ls 
command.

Wei-Chiu Chuang

> On Sep 20, 2016, at 6:59 PM, kevin  wrote:
> 
> Thank you Brahma Reddy Battula.
> It's because of my problerm of the hdfs-site config file and https ca 
> configuration.
> now I can startup namenode and I can see the datanodes from the web.
> but When I try hdfs dfs -ls /:
> 
> [hadoop@dmp1 hadoop-2.7.3]$ hdfs dfs -ls /
> 16/09/20 07:56:48 WARN ipc.Client: Exception encountered while connecting to 
> the server : javax.security.sasl.SaslException: GSS initiate failed [Caused 
> by GSSException: No valid credentials provided (Mechanism level: Failed to 
> find any Kerberos tgt)]
> ls: Failed on local exception: java.io.IOException: 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]; Host Details : local host is: 
> "dmp1.example.com/192.168.249.129 <http://dmp1.example.com/192.168.249.129>"; 
> destination host is: "dmp1.example.com":9000; 
> 
> current user is hadoop which startup hdfs , and I have add addprinc hadoop 
> with commond :
> kadmin.local -q "addprinc hadoop" 
> 
> 
> 2016-09-20 17:33 GMT+08:00 Brahma Reddy Battula 
> mailto:brahmareddy.batt...@huawei.com>>:
> Seems to be property problem.. it should be principal ( “l” is missed).
> 
>  
> 
> 
> 
>   dfs.secondary.namenode.kerberos.principa
> 
>   hadoop/_h...@example.com <mailto:h...@example.com>
> 
> 
> 
>  
> 
>  
> 
> For namenode httpserver start fail, please check rakesh comments..
> 
>  
> 
> This is probably due to some missing configuration. 
> 
> Could you please re-check the ssl-server.xml, keystore and truststore 
> properties:
> 
>  
> 
> ssl.server.keystore.location
> 
> ssl.server.keystore.keypassword
> 
> ssl.client.truststore.location
> 
> ssl.client.truststore.password
> 
>  
> 
>  
> 
> --Brahma Reddy Battula
> 
>  
> 
> From: kevin [mailto:kiss.kevin...@gmail.com <mailto:kiss.kevin...@gmail.com>] 
> Sent: 20 September 2016 16:53
> To: Rakesh Radhakrishnan
> Cc: user.hadoop
> Subject: Re: hdfs2.7.3 kerberos can not startup
> 
>  
> 
> thanks, but my issue is name node could  Login successful,but second namenode 
> couldn't. and name node got a HttpServer.start() threw a non Bind IOException:
> 
>  
> 
> hdfs-site.xml:
> 
>  
> 
> 
> 
> dfs.webhdfs.enabled
> 
> true
> 
> 
> 
>  
> 
> 
> 
>   dfs.block.access.token.enable
> 
>   true
> 
> 
> 
>  
> 
> 
> 
> 
> 
>   dfs.namenode.kerberos.principal
> 
>   hadoop/_h...@example.com <mailto:h...@example.com>
> 
> 
> 
> 
> 
>   dfs.namenode.keytab.file
> 
>   /etc/hadoop/conf/hdfs.keytab
> 
> 
> 
> 
> 
>   dfs.https.port
> 
>   50470
> 
> 
> 
> 
> 
>   dfs.namenode.https-address
> 
>   dmp1.example.com:50470 <http://dmp1.example.com:50470/>
> 
> 
> 
> 
> 
>   dfs.namenode.kerberos.internal.spnego.principa
> 
>   HTTP/_h...@example.com <mailto:h...@example.com>
> 
> 
> 
> 
> 
>   dfs.web.authentication.kerberos.keytab
> 
>   /etc/hadoop/conf/hdfs.keytab
> 
> 
> 
> 
> 
>   dfs.http.policy
> 
>   HTTPS_ONLY
> 
> 
> 
> 
> 
>   dfs.https.enable
> 
>   true
> 
> 
> 
>  
> 
>  
> 
> 
> 
> 
> 
>   dfs.namenode.secondary.http-address
> 
>   dmp1.example.com:50090 <http://dmp1.example.com:50090/>
> 
> 
> 
> 
> 
>   dfs.secondary.namenode.keytab.file
> 
>   /etc/hadoop/conf/hdfs.keytab
> 
> 
> 
> 
> 
>   dfs.secondary.namenode.kerberos.principa
> 
>   hadoop/_h...@example.com <mailto:h...@example.com>
> 
>  
> 
> 
> 
>   dfs.secondary.namenode.kerberos.internal.spnego.principal
> 
>   HTTP/_h...@example.com <mailto:h...@example.com>
> 
> 
> 
> 
> 
>   dfs.namenode.secondary.https-port
> 
>   50470
> 
> 
> 
>  
> 
>  
> 
> 
> 
>  
> 
> 
> 
>   dfs.journalnode.keytab.file
> 
>   /etc/hadoop/conf/hdfs.keytab
> 
> 
> 
> 
> 
>   dfs.journalnode.kerberos.principa
> 
>   hadoop/_h...@example.com <mailto:h...@example.com>
> 
>  
> 
> 
> 
>   dfs.journalnode.kerberos.internal.spnego.principa
> 
>   HTTP/_h...@example.com <mailto:h...@example.com>
> 
> 
> 
> 
> 
>   dfs.web.authentication.kerberos.key

  1   2   >