from:"Wei\-Chiu Chuang"

[jira] [Created] (HDFS-14228) Incorrect getSnapshottableDirListing() javadoc

2019-01-24 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-14228:
--

 Summary: Incorrect getSnapshottableDirListing() javadoc
 Key: HDFS-14228
 URL: https://issues.apache.org/jira/browse/HDFS-14228
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: snapshots
Affects Versions: 2.1.0-beta
Reporter: Wei-Chiu Chuang


The Javadoc for {{DistributedFileSystem#getSnapshottableDirListing()}} is not 
consistent with {{FSNamesystem#getSnapshottableDirListing()}}

{code:title=ClientProtocol#getSnapshottableDirListing()}
/**
   * Get listing of all the snapshottable directories.
   *
   * @return Information about all the current snapshottable directory
   * @throws IOException If an I/O error occurred
   */
  @Idempotent
  @ReadOnly(isCoordinated = true)
  SnapshottableDirectoryStatus[] getSnapshottableDirListing()
  throws IOException;
{code}

{code:title=DistributedFileSystem#getSnapshottableDirListing()}
/**
   * @return All the snapshottable directories
   * @throws IOException
   */
  public SnapshottableDirectoryStatus[] getSnapshottableDirListing()
{code}

But the implementation at NameNode side is:
{code:title=FSNamesystem#getSnapshottableDirListing()}
/**
   * Get the list of snapshottable directories that are owned 
   * by the current user. Return all the snapshottable directories if the 
   * current user is a super user.
   * @return The list of all the current snapshottable directories
   * @throws IOException
   */
  public SnapshottableDirectoryStatus[] getSnapshottableDirListing()
{code}

That is, if this method is called by a non-super user, it does not return all 
snapshottable directories. File this jira to get this corrected to avoid 
confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-14194) Mention HDFS ACL incompatible changes more explicitly

2019-01-09 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-14194:
--

 Summary: Mention HDFS ACL incompatible changes more explicitly
 Key: HDFS-14194
 URL: https://issues.apache.org/jira/browse/HDFS-14194
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: documentation, namenode
Affects Versions: 3.0.0-beta1
Reporter: Wei-Chiu Chuang


HDFS-11957 enabled POSIX ACL inheritance by default, setting 
dfs.namenode.posix.acl.inheritance.enabled.

Even though it was documented in the ACL doc, it is not explicit. Users upgrade 
to Hadoop 3.0 and beyond will be caught in surprise. The doc should be updated 
to make it clear, preferably with examples to show what to expect, so that 
search engines can hopefully find the doc.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-14176) Replace incorrect use of system property user.name

2018-12-27 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-14176:
--

 Summary: Replace incorrect use of system property user.name
 Key: HDFS-14176
 URL: https://issues.apache.org/jira/browse/HDFS-14176
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.3.0
 Environment: Kerberized
Reporter: Wei-Chiu Chuang


Looking at the Hadoop source code, there are a few places where the code 
assumes the user name can be acquired from Java's system property {{user.name}}.

For example,
{code:java|title=FileSystem}
/** Return the current user's home directory in this FileSystem.
   * The default implementation returns {@code "/user/$USER/"}.
   */
  public Path getHomeDirectory() {
return this.makeQualified(
new Path(USER_HOME_PREFIX + "/" + System.getProperty("user.name")));
  }
{code}
This is incorrect, as in a Kerberized environment, a user may login as a user 
principal different from its system login account.

It would be better to use 
{{UserGroupInformation.getCurrentUser().getShortUserName()}}, similar to 
HDFS-12485.

Unfortunately, I am seeing this improper use in Yarn, HDFS federation 
SFTPFilesystem and Ozone code (tests are ignored)

The impact should be small, since it only affects the case where system is 
Kerberized and that the user principal is different from system login account.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re:

2018-12-20 Thread Wei-Chiu Chuang

+Hdfs-dev 
Hi Shuubham,

Just like to clarify a bit. What's the purpose of this work? Is this for
the general block placement policy in HDFS, or the
balancer/mover/diskbalancer, or decommissioning/recommissioning? Block
placement is determined by NameNode. Do you intend to shorten the time to
decide where a block is placed? Do you want to reduce the time such that
re-replication takes less time?

I'm asking this because I don't think there's ever a placementmonitor or a
blockmonitor class.

On Wed, Dec 19, 2018 at 10:36 PM Shuubham Ojha 
wrote:

> Hello All,
>
>I am Shuubham Ojha a graduate researcher with the
> University Of Melbourne. We have developed a block placement strategy which
> optimises delay associated with reconstruction. As a result of this
> optimisation problem, we get a placement matrix for blocks which tells us
> which block has to be placed at which node. We have been able to implement
> this strategy in Hadoop 2 by tweaking the file *placementmonitor.java*
> and *blockmover.java* where *placementmonitor.java* monitors the
> placement process and calls *blockmover.java* when the placement is not
> according the strategy. However, I can't find any such file analogous to
> *placementmonitor.java* in Hadoop 3 although I think that the closest
> file which performs this function is *balancer.java* located in
> hadoop-hdfs-project. Can anybody please provide me more information on this
> front?
>
>
> Warm Regards,
>
> Shuubham Ojha
>
> University Of Melbourne,
>
> Victoria, Australia- 3010
>

Changing RPC SASL options without full cluster restart?

2018-12-14 Thread Wei-Chiu Chuang

Hi fellow Hadoop developers,

Do you know a way to change RPC SASL options without full cluster restart?
(that is, rolling restart)? For example, enabling RPC encryption? Currently
if you try to do rolling restart after enabling RPC encryption,
applications such as HBase would fail to connect to NameNode because both
side use different SASL configurations during the rolling restart.

Would HDFS-13566 (Add configurable additional RPC listener to NameNode) and
HDFS-13547 (Add ingress port based sasl resolver) help address this issue?
I imagine some hack can be developed along the line, but I don't know if
that use case is considered in the design.

Best,
Wei-Chiu

Re: [DISCUSS] Hadoop RPC encryption performance improvements

2018-12-05 Thread Wei-Chiu Chuang

Thanks Daryn for your work. I saw you filed an upstream jira HADOOP-15977
<https://issues.apache.org/jira/browse/HADOOP-15977> and uploaded some
patches for review.
I'm watching the jira and will review shortly as fast as I can.

Best


On Wed, Oct 31, 2018 at 7:39 AM Daryn Sharp  wrote:

> Various KMS tasks have been delaying my RPC encryption work – which is 2nd
> on TODO list.  It's becoming a top priority for us so I'll try my best to
> get a preliminary netty server patch (sans TLS) up this week if that helps.
>
> The two cited jiras had some critical flaws.  Skimming my comments, both
> use blocking IO (obvious nonstarter).  HADOOP-10768 is a hand rolled
> TLS-like encryption which I don't feel is something the community can or
> should maintain from a security standpoint.
>
> Daryn
>
> On Wed, Oct 31, 2018 at 8:43 AM Wei-Chiu Chuang 
> wrote:
>
>> Ping. Any one? Cloudera is interested in moving forward with the RPC
>> encryption improvements, but I just like to get a consensus which approach
>> to go with.
>>
>> Otherwise I'll pick HADOOP-10768 since it's ready for commit, and I've
>> spent time on testing it.
>>
>> On Thu, Oct 25, 2018 at 11:04 AM Wei-Chiu Chuang 
>> wrote:
>>
>> > Folks,
>> >
>> > I would like to invite all to discuss the various Hadoop RPC encryption
>> > performance improvements. As you probably know, Hadoop RPC encryption
>> > currently relies on Java SASL, and have _really_ bad performance (in
>> terms
>> > of number of RPCs per second, around 15~20% of the one without SASL)
>> >
>> > There have been some attempts to address this, most notably,
>> HADOOP-10768
>> > <https://issues.apache.org/jira/browse/HADOOP-10768> (Optimize Hadoop
>> RPC
>> > encryption performance) and HADOOP-13836
>> > <https://issues.apache.org/jira/browse/HADOOP-13836> (Securing Hadoop
>> RPC
>> > using SSL). But it looks like both attempts have not been progressing.
>> >
>> > During the recent Hadoop contributor meetup, Daryn Sharp mentioned he's
>> > working on another approach that leverages Netty for its SSL encryption,
>> > and then integrate Netty with Hadoop RPC so that Hadoop RPC
>> automatically
>> > benefits from netty's SSL encryption performance.
>> >
>> > So there are at least 3 attempts to address this issue as I see it. Do
>> we
>> > have a consensus that:
>> > 1. this is an important problem
>> > 2. which approach we want to move forward with
>> >
>> > --
>> > A very happy Hadoop contributor
>> >
>>
>>
>> --
>> A very happy Hadoop contributor
>>
>
>
> --
>
> Daryn
>

[jira] [Created] (HDFS-14126) DataNode DirectoryScanner holding global lock for too long

2018-12-04 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-14126:
--

 Summary: DataNode DirectoryScanner holding global lock for too long
 Key: HDFS-14126
 URL: https://issues.apache.org/jira/browse/HDFS-14126
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0
Reporter: Wei-Chiu Chuang


I've got a Hadoop 3 based cluster set up, and this DN has just 434 thousand 
blocks.

And yet, DirectoryScanner holds the fsdataset lock for 2.7 seconds:

{quote}
2018-12-03 21:33:09,130 INFO 
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner: BlockPool 
BP-4588049-10.17.XXX-XX-281857726 Total blocks: 434401, missing metadata fi
les:0, missing block files:0, missing blocks in memory:0, mismatched blocks:0
2018-12-03 21:33:09,131 WARN 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Lock held 
time above threshold: lock identifier: org.apache.hadoop.hdfs.serve
r.datanode.fsdataset.impl.FsDatasetImpl lockHeldTimeMs=2710 ms. Suppressed 0 
lock warnings. The stack trace is: 
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:1032)
org.apache.hadoop.util.InstrumentedLock.logWarning(InstrumentedLock.java:148)
org.apache.hadoop.util.InstrumentedLock.check(InstrumentedLock.java:186)
org.apache.hadoop.util.InstrumentedLock.unlock(InstrumentedLock.java:133)
org.apache.hadoop.util.AutoCloseableLock.release(AutoCloseableLock.java:84)
org.apache.hadoop.util.AutoCloseableLock.close(AutoCloseableLock.java:96)
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.scan(DirectoryScanner.java:473)
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.reconcile(DirectoryScanner.java:373)
org.apache.hadoop.hdfs.server.datanode.DirectoryScanner.run(DirectoryScanner.java:318)
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)
{quote}

Log messages like this repeats every several hours (6, to be exact). I am not 
sure if this is a performance regression, or just the fact that the lock 
information is printed in Hadoop 3. [~vagarychen] or [~templedf] do you know?

There's no log in DN to indicate any sort of JVM GC going on. Plus, the DN's 
heap size is set to several GB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-8533) Mismatch in displaying the "MissingBlock" count in fsck and in other metric reports

2018-12-03 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-8533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-8533.
---
Resolution: Duplicate

I'm going to resolve this jira because HDFS-10213 fixed the bug 
unintentionally, and that HDFS-13999 is fixing it for 2.7.x.

Will commit HDFS-13999 in a short bit.

> Mismatch in displaying the "MissingBlock" count in fsck and in other metric 
> reports
> ---
>
> Key: HDFS-8533
> URL: https://issues.apache.org/jira/browse/HDFS-8533
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: J.Andreina
>Assignee: J.Andreina
>Priority: Critical
>
> Number of DN = 2
> Step 1: Write a file with replication factor - 3 .
> Step 2: Corrupt a replica in DN1
> Step 3: DN2 is down. 
> Missing Block count in  report is as follows
> Fsck report: *0*
> Jmx, "dfsadmin -report" , UI, logs : *1*
> In fsck , only block whose replicas are all missed and not been corrupted are 
> counted 
> {code}
> if (totalReplicasPerBlock == 0 && !isCorrupt) {
> // If the block is corrupted, it means all its available replicas are
> // corrupted. We don't mark it as missing given these available 
> replicas
> // might still be accessible as the block might be incorrectly marked 
> as
> // corrupted by client machines.
> {code}
> While in other reports even if all the replicas are corrupted , block is been 
> considered as missed.
> Please provide your thoughts : can we make missing block count consistent 
> across all the reports same as implemented for fsck?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-14122) KMS Benchmark Tool

2018-12-03 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-14122:
--

 Summary: KMS Benchmark Tool
 Key: HDFS-14122
 URL: https://issues.apache.org/jira/browse/HDFS-14122
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


We've been working on several pieces of KMS improvement work. One thing that's 
missing so far is a good benchmark tool for KMS, similar to 
NNThroughputBenchmark.

Some requirements I have in mind:
# it should be a standalone benchmark tool, requiring only KMS and a benchmark 
client. No NameNode or DataNode should be involved.
# specify the type of KMS request sent by client. E.g., generate_eek, 
decrypt_eek, reencrypt_eek
# optionally specify number of threads sending KMS requests.

File this jira to gather more requirements. Thoughts? [~knanasi] [~xyao] 
[~daryn]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: RPC connect error when using kerberos Auth

2018-11-28 Thread Wei-Chiu Chuang

Not sure about integrity -- I've seen very few if at all installations with
integrity option enabled.
Regarding privacy -- have you made sure both client and server enabled SASL
privacy? Both sides must have consistent RPC settings for them to talk.

On Wed, Nov 28, 2018 at 12:58 AM ZongtianHou 
wrote:

> Hi,everyone
> I am using a hdfs client API to access a secured hdfs cluster. The
> kerberos have been set up successfully. When the configuration of
> hadoop.rpc.protection in core-site.xml is set to authentication, it works
> well. However, when it is set to integration or privacy, the namenode can
> not be connected, and the log of namenode give the following error. Does
> anyone know what the info mean and what lib is needed for connection in
> integration and privacy mode? Any hint will be very appreciated!!
>
> 2018-11-27 17:14:05,270 WARN SecurityLogger.org.apache.hadoop.ipc.Server:
> Auth failed for 127.0.0.1:50769:null (Problem with callback handler)
> 2018-11-27 17:14:05,270 INFO org.apache.hadoop.ipc.Server: Socket Reader
> #1 for port 8020: readAndProcess from client 127.0.0.1 threw exception
> [javax.security.sasl.SaslException: Problem with callback handler [Caused
> by javax.security.sasl.SaslException: Client selected unsupported
> protection: 1]]
>
>
>
> -
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>
>

Re: [DISCUSS] Hadoop RPC encryption performance improvements

2018-11-02 Thread Wei-Chiu Chuang

Thanks all for the inputs,

To offer additional information (while Daryn is working on his stuff),
optimizing RPC encryption opens up another possibility: migrating KMS
service to use Hadoop RPC.

Today's KMS uses HTTPS + REST API, much like webhdfs. It has very
undesirable performance (a few thousand ops per second) compared to
NameNode. Unfortunately for each NameNode namespace operation you also need
to access KMS too.

Migrating KMS to Hadoop RPC greatly improves its performance (if
implemented correctly), and RPC encryption would be a prerequisite. So
please keep that in mind when discussing the Hadoop RPC encryption
improvements. Cloudera is very interested to help with the Hadoop RPC
encryption project because a lot of our customers are using at-rest
encryption, and some of them are starting to hit KMS performance limit.

This whole "migrating KMS to Hadoop RPC" was Daryn's idea. I heard this
idea in the meetup and I am very thrilled to see this happening because it
is a real issue bothering some of our customers, and I suspect it is the
right solution to address this tech debt.

On Fri, Nov 2, 2018 at 1:21 PM Todd Lipcon 
wrote:

> One possibility (which we use in Kudu) is to use SSL for encryption but
> with a self-signed certificate, maintaining the existing SASL/GSSAPI
> handshake for authentication. The one important bit here, security wise, is
> to implement channel binding (RFC 5056 and RFC 5929) to prevent against
> MITMs. The description of the Kudu protocol is here:
>
> https://github.com/apache/kudu/blob/master/docs/design-docs/rpc.md#wire-protocol
>
> If implemented correctly, this provides TLS encryption (with all of its
> performance and security benefits) without requiring the user to deploy a
> custom cert.
>
> -Todd
>
> On Thu, Nov 1, 2018 at 7:14 PM Konstantin Shvachko 
> wrote:
>
> > Hi Wei-Chiu,
> >
> > Thanks for starting the thread and summarizing the problem. Sorry for
> slow
> > response.
> > We've been looking at the encrypted performance as well and are
> interested
> > in this effort.
> > We ran some benchmarks locally. Our benchmarks also showed substantial
> > penalty for turning on wire encryption on rpc.
> > Although it was less drastic - more in the range of -40%. But we ran a
> > different benchmark NNThroughputBenchmark, and we ran it on 2.6 last
> year.
> > Could have published the results, but need to rerun on more recent
> > versions.
> >
> > Three points from me on this discussion:
> >
> > 1. We should settle on the benchmarking tools.
> > For development RPCCallBenchmark is good as it measures directly the
> > improvement on the RPC layer. But for external consumption it is more
> > important to know about e.g. NameNode RPCs performance. So we probably
> > should run both benchmarks.
> > 2. SASL vs SSL.
> > Since current implementation is based on SASL, I think it would make
> sense
> > to make improvements in this direction. I assume switching to SSL would
> > require changes in configuration. Not sure if it will be compatible,
> since
> > we don't have the details. At this point I would go with HADOOP-10768.
> > Given all (Daryn's) concerns are addressed.
> > 3. Performance improvement expectations.
> > Ideally we want to have < 10% penalty for encrypted communication.
> Anything
> > over 30% will probably have very limited usability. And there is the gray
> > area in between, which could be mitigated by allowing mixed encrypted and
> > un-encrypted RPCs on the single NameNode like in HDFS-13566.
> >
> > Thanks,
> > --Konstantin
> >
> > On Wed, Oct 31, 2018 at 7:39 AM Daryn Sharp 
> > wrote:
> >
> > > Various KMS tasks have been delaying my RPC encryption work – which is
> > 2nd
> > > on TODO list.  It's becoming a top priority for us so I'll try my best
> to
> > > get a preliminary netty server patch (sans TLS) up this week if that
> > helps.
> > >
> > > The two cited jiras had some critical flaws.  Skimming my comments,
> both
> > > use blocking IO (obvious nonstarter).  HADOOP-10768 is a hand rolled
> > > TLS-like encryption which I don't feel is something the community can
> or
> > > should maintain from a security standpoint.
> > >
> > > Daryn
> > >
> > > On Wed, Oct 31, 2018 at 8:43 AM Wei-Chiu Chuang 
> > > wrote:
> > >
> > > > Ping. Any one? Cloudera is interested in moving forward with the RPC
> > > > encryption improvements, but I just like to get a consensus which
> > > approach
> > > > to go with.
> > > >
> > > > Otherwise I'll

Re: [DISCUSS] Hadoop RPC encryption performance improvements

2018-10-31 Thread Wei-Chiu Chuang

Ping. Any one? Cloudera is interested in moving forward with the RPC
encryption improvements, but I just like to get a consensus which approach
to go with.

Otherwise I'll pick HADOOP-10768 since it's ready for commit, and I've
spent time on testing it.

On Thu, Oct 25, 2018 at 11:04 AM Wei-Chiu Chuang  wrote:

> Folks,
>
> I would like to invite all to discuss the various Hadoop RPC encryption
> performance improvements. As you probably know, Hadoop RPC encryption
> currently relies on Java SASL, and have _really_ bad performance (in terms
> of number of RPCs per second, around 15~20% of the one without SASL)
>
> There have been some attempts to address this, most notably, HADOOP-10768
> <https://issues.apache.org/jira/browse/HADOOP-10768> (Optimize Hadoop RPC
> encryption performance) and HADOOP-13836
> <https://issues.apache.org/jira/browse/HADOOP-13836> (Securing Hadoop RPC
> using SSL). But it looks like both attempts have not been progressing.
>
> During the recent Hadoop contributor meetup, Daryn Sharp mentioned he's
> working on another approach that leverages Netty for its SSL encryption,
> and then integrate Netty with Hadoop RPC so that Hadoop RPC automatically
> benefits from netty's SSL encryption performance.
>
> So there are at least 3 attempts to address this issue as I see it. Do we
> have a consensus that:
> 1. this is an important problem
> 2. which approach we want to move forward with
>
> --
> A very happy Hadoop contributor
>


-- 
A very happy Hadoop contributor

[DISCUSS] Hadoop RPC encryption performance improvements

2018-10-25 Thread Wei-Chiu Chuang

Folks,

I would like to invite all to discuss the various Hadoop RPC encryption
performance improvements. As you probably know, Hadoop RPC encryption
currently relies on Java SASL, and have _really_ bad performance (in terms
of number of RPCs per second, around 15~20% of the one without SASL)

There have been some attempts to address this, most notably, HADOOP-10768
 (Optimize Hadoop RPC
encryption performance) and HADOOP-13836
 (Securing Hadoop RPC
using SSL). But it looks like both attempts have not been progressing.

During the recent Hadoop contributor meetup, Daryn Sharp mentioned he's
working on another approach that leverages Netty for its SSL encryption,
and then integrate Netty with Hadoop RPC so that Hadoop RPC automatically
benefits from netty's SSL encryption performance.

So there are at least 3 attempts to address this issue as I see it. Do we
have a consensus that:
1. this is an important problem
2. which approach we want to move forward with

-- 
A very happy Hadoop contributor

[jira] [Resolved] (HDFS-14018) Compilation fails in branch-3.0

2018-10-24 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-14018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-14018.

Resolution: Done

> Compilation fails in branch-3.0
> ---
>
> Key: HDFS-14018
> URL: https://issues.apache.org/jira/browse/HDFS-14018
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 3.0.4
>Reporter: Rohith Sharma K S
>Priority: Blocker
>
> HDFS branch-3.0 compilation fails.
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) 
> on project hadoop-hdfs: Compilation failure
> [ERROR] 
> /Users/rsharmaks/Repos/Apache/Commit_Repos/branch-3.0/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/security/token/block/BlockTokenSecretManager.java:[306,9]
>  cannot find symbol
> [ERROR]   symbol:   variable ArrayUtils
> [ERROR]   location: class 
> org.apache.hadoop.hdfs.security.token.block.BlockTokenSecretManager
> [ERROR]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

HDFS Native tests are broken

2018-10-23 Thread Wei-Chiu Chuang

Folks,
Looks like a bunch of native tests are failing consistently. Looking back
the earliest failure I saw was from a precommit job for HDFS-1915
 in October 3rd. Some
tests failed miserably with JVM crash error.

@Pranay Singh   is trying to figure out what's going
on there (HDFS-14022). Please shout out if you suspect anything that might
have broke the tests. This is impeding our ability to check in any native
code changes.

Thank you
-- 
A very happy Hadoop contributor

[jira] [Reopened] (HDFS-13941) make storageId in BlockPoolTokenSecretManager.checkAccess optional

2018-10-23 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-13941:


Hello [~ajayydv] thanks for your patch. Looks like branch-3.0 doesn't compile 
after this commit. I am going to revert the branch-3.0 commit to unblock other 
committers.

> make storageId in BlockPoolTokenSecretManager.checkAccess optional
> --
>
> Key: HDFS-13941
> URL: https://issues.apache.org/jira/browse/HDFS-13941
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Ajay Kumar
>Assignee: Ajay Kumar
>Priority: Major
> Fix For: 3.2.0, 3.0.4, 3.1.2, 3.3.0
>
> Attachments: HDFS-13941.00.patch, HDFS-13941.01.patch, 
> HDFS-13941.02.patch
>
>
> Change in {{BlockPoolTokenSecretManager.checkAccess}} by 
> [HDDS-9807|https://issues.apache.org/jira/browse/HDFS-9807] breaks backward 
> compatibility for applications using the private API (we've run into such 
> apps).
> Although there is no compatibility guarantee for the private interface, we 
> can restore the original version of checkAccess as an overload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: Hadoop 3.2 Release Plan proposal

2018-10-19 Thread Wei-Chiu Chuang

Thanks Sunil G for driving the release,
I filed HADOOP-15866  for
a compat fix. If any one has cycle please review it, as I think it is
needed for 3.2.0.

On Thu, Oct 18, 2018 at 4:43 AM Sunil G  wrote:

> Hi Folks,
>
> As we previously communicated for 3.2.0 release, we have delayed due to few
> blockers in our gate.
>
> I just cut branch-3.2.0 for release purpose. branch-3.2 will be open for
> all bug fixes.
>
> - Sunil
>
>
> On Tue, Oct 16, 2018 at 8:59 AM Sunil G  wrote:
>
> > Hi Folks,
> >
> > We are now close to RC as other blocker issues are now merged to trunk
> and
> > branch-3.2. Last 2 critical issues are closer to merge and will be
> > committed in few hours.
> > With this, I will be creating 3.2.0 branch today and will go ahead with
> RC
> > related process.
> >
> > - Sunil
> >
> > On Mon, Oct 15, 2018 at 11:43 PM Jonathan Bender 
> > wrote:
> >
> >> Hello, were there any updates around the 3.2.0 RC timing? All I see in
> >> the current blockers are related to the new Submarine subproject, wasn't
> >> sure if that is what is holding things up.
> >>
> >> Cheers,
> >> Jon
> >>
> >> On Tue, Oct 2, 2018 at 7:13 PM, Sunil G  wrote:
> >>
> >>> Thanks Robert and Haibo for quickly correcting same.
> >>> Sigh, I somehow missed one file while committing the change. Sorry for
> >>> the
> >>> trouble.
> >>>
> >>> - Sunil
> >>>
> >>> On Wed, Oct 3, 2018 at 5:22 AM Robert Kanter 
> >>> wrote:
> >>>
> >>> > Looks like there's two that weren't updated:
> >>> > >> [115] 16:32 : hadoop-common (trunk) :: grep "3.2.0-SNAPSHOT" . -r
> >>> > --include=pom.xml
> >>> > ./hadoop-project/pom.xml:
> >>> > 3.2.0-SNAPSHOT
> >>> > ./pom.xml:3.2.0-SNAPSHOT
> >>> >
> >>> > I've just pushed in an addendum commit to fix those.
> >>> > In the future, please make sure to do a sanity compile when updating
> >>> poms.
> >>> >
> >>> > thanks
> >>> > - Robert
> >>> >
> >>> > On Tue, Oct 2, 2018 at 11:44 AM Aaron Fabbri
> >>> 
> >>> > wrote:
> >>> >
> >>> >> Trunk is not building for me.. Did you miss a 3.2.0-SNAPSHOT in the
> >>> >> top-level pom.xml?
> >>> >>
> >>> >>
> >>> >> On Tue, Oct 2, 2018 at 10:16 AM Sunil G  wrote:
> >>> >>
> >>> >> > Hi All
> >>> >> >
> >>> >> > As mentioned in earlier mail, I have cut branch-3.2 and reset
> trunk
> >>> to
> >>> >> > 3.3.0-SNAPSHOT. I will share the RC details sooner once all
> >>> necessary
> >>> >> > patches are pulled into branch-3.2.
> >>> >> >
> >>> >> > Thank You
> >>> >> > - Sunil
> >>> >> >
> >>> >> >
> >>> >> > On Mon, Sep 24, 2018 at 2:00 PM Sunil G 
> wrote:
> >>> >> >
> >>> >> > > Hi All
> >>> >> > >
> >>> >> > > We are now down to the last Blocker and HADOOP-15407 is merged
> to
> >>> >> trunk.
> >>> >> > > Thanks for the support.
> >>> >> > >
> >>> >> > > *Plan for RC*
> >>> >> > > 3.2 branch cut and reset trunk : *25th Tuesday*
> >>> >> > > RC0 for 3.2: *28th Friday*
> >>> >> > >
> >>> >> > > Thank You
> >>> >> > > Sunil
> >>> >> > >
> >>> >> > >
> >>> >> > > On Mon, Sep 17, 2018 at 3:21 PM Sunil G 
> >>> wrote:
> >>> >> > >
> >>> >> > >> Hi All
> >>> >> > >>
> >>> >> > >> We are down to 3 Blockers and 4 Critical now. Thanks all of you
> >>> for
> >>> >> > >> helping in this. I am following up on these tickets, once its
> >>> closed
> >>> >> we
> >>> >> > >> will cut the 3.2 branch.
> >>> >> > >>
> >>> >> > >> Thanks
> >>> >> > >> Sunil Govindan
> >>> >> > >>
> >>> >> > >>
> >>> >> > >> On Wed, Sep 12, 2018 at 5:10 PM Sunil G 
> >>> wrote:
> >>> >> > >>
> >>> >> > >>> Hi All,
> >>> >> > >>>
> >>> >> > >>> Inline with the original 3.2 communication proposal dated 17th
> >>> July
> >>> >> > >>> 2018, I would like to provide more updates.
> >>> >> > >>>
> >>> >> > >>> We are approaching previously proposed code freeze date
> >>> (September
> >>> >> 14,
> >>> >> > >>> 2018). So I would like to cut 3.2 branch on 17th Sept and
> point
> >>> >> > existing
> >>> >> > >>> trunk to 3.3 if there are no issues.
> >>> >> > >>>
> >>> >> > >>> *Current Release Plan:*
> >>> >> > >>> Feature freeze date : all features to merge by September 7,
> >>> 2018.
> >>> >> > >>> Code freeze date : blockers/critical only, no improvements and
> >>> >> > >>> blocker/critical bug-fixes September 14, 2018.
> >>> >> > >>> Release date: September 28, 2018
> >>> >> > >>>
> >>> >> > >>> If any critical/blocker tickets which are targeted to 3.2.0,
> we
> >>> >> need to
> >>> >> > >>> backport to 3.2 post branch cut.
> >>> >> > >>>
> >>> >> > >>> Here's an updated 3.2.0 feature status:
> >>> >> > >>>
> >>> >> > >>> 1. Merged & Completed features:
> >>> >> > >>>
> >>> >> > >>> - (Wangda) YARN-8561: Hadoop Submarine project for
> DeepLearning
> >>> >> > >>> workloads Initial cut.
> >>> >> > >>> - (Uma) HDFS-10285: HDFS Storage Policy Satisfier
> >>> >> > >>> - (Sunil) YARN-7494: Multi Node scheduling support in Capacity
> >>> >> > >>> Scheduler.
> >>> >> > >>> - (Chandni/Eric) YARN-7512: Support service upgrade via YARN
> >>> Service
> >>> >> > API
> >>> >>

[jira] [Created] (HDFS-13999) Bogus missing block warning if the file is under construction when NN starts

2018-10-16 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13999:
--

 Summary: Bogus missing block warning if the file is under 
construction when NN starts
 Key: HDFS-13999
 URL: https://issues.apache.org/jira/browse/HDFS-13999
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
 Attachments: webui missing blocks.png

We found an interesting case where web UI displays a few missing blocks, but it 
doesn't state which files are corrupt. What'll also happen is that fsck states 
the file system is healthy. This bug is similar to HDFS-10827 and HDFS-8533. 

 (See the attachment for an example)

Using Dynamometer, I was able to reproduce the bug, and realized the the 
"missing" blocks are actually healthy, but somehow neededReplications doesn't 
get updated when NN receives block reports. What's more interesting is that the 
files associated with the "missing" blocks are under construction when NN 
starts, and so after a while NN prints file recovery log.

Given that, I determined the following code is the source of bug:
{code:java|title=BlockManager#addStoredBlock}

   // if file is under construction, then done for now
if (bc.isUnderConstruction()) {
  return storedBlock;
}
{code}
which is wrong, because a file may have multiple blocks, and the first block is 
complete. In which case, the neededReplications structure doesn't get updated 
for the first block, and thus the missing block warning on the web UI. More 
appropriately, it should check the state of the block itself, not the file.

Fortunately, it was unintentionally fixed via HDFS-9754:
{code:java}
// if block is still under construction, then done for now
if (!storedBlock.isCompleteOrCommitted()) {
  return storedBlock;
}
{code}
We should bring this fix into branch-2.7 too. That said, this is a harmless 
warning, and should go away after the under-construction-files are recovered, 
and the NN restarts (or force full block reports).

Kudos to Dynamometer! It would be impossible to reproduce this bug without the 
tool. And thanks [~smeng] for helping with the reproduction.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[DISCUSS] Deprecate fuse-dfs

2018-10-01 Thread Wei-Chiu Chuang

Hi fellow Hadoop developers,

I want to start this thread to raise the awareness of the quality of
fuse-dfs. It appears that this sub-component is not being developed and
maintained, and appears not many are using it.

In the past two years, there has been only one bug fixed (HDFS-13322).


https://issues.apache.org/jira/issues/?jql=project%20in%20(HADOOP%2C%20HDFS)%20AND%20text%20~%20fuse%20ORDER%20BY%20created%20DESC%2C%20updated%20DESC

It doesn't support keytab login, ACL permissions, rename, ... a number of
POSIX semantics. We also recently realized fuse-dfs doesn't work under
heavy weight workload (Think running SQL applications on it)

So what's the status now? Is there any one who is still using fuse-dfs in
production? Should we start the deprecation process? Or at least document
that it is not meant for anything beyond simple data transfer? IIRC vim
would even complain if you try to edit a file in fuse_dfs directory.
-- 
A very happy Hadoop contributor

Re: NN run progressively slower

2018-09-25 Thread Wei-Chiu Chuang

Yiqun,
Is this related to HDFS-9260?
Note that HDFS-9260 was backported since CDH5.7 and above.

I'm interested to learn more. Did you observe clients failing to close file
due to insufficient number of block replicas? Did NN fail over?
Did you have gc logging enabled? Any chance to take a heap dump and analyze
what's in there?

There were quite some NN scalability and GC improvements between CDH5.5 ~
CDH5.8 time frame. We have customers at/beyond your scale in your version
but I don't think I've heard similar symptoms.

Regards

On Tue, Sep 25, 2018 at 2:04 AM Lin,Yiqun(vip.com) 
wrote:

> Hi hdfs developers:
>
> We meet a bad problem after rolling upgrade our hadoop version from
> 2.5.0-cdh5.3.2 to 2.6.0-cdh5.13.1. The problem is that we find NN running
> slow periodically (around a week). Concretely to say, For example, we
> startup NN on Monday, it will run fast. But time coming to Weekends, our
> cluster will become very slow.
>
> In the beginning, we thought maybe some FSN lock caused by this. And we
> did some improvements for this, e.g. configurable the remove block
> interval, print FSN lock elapsed time. After this, the problem still
> exists, :(. So we suspect this maybe not a hdfs rpc problem.
>
> Finally we find a related phenomenon: every time NN runs slow, its old gen
> reaches a high value, around 100GB. Actually, NN total metadata size is
> just around 40GB in our clsuter. So for the temporary solution, we reduce
> the heap space and trigger full gc frequently. Now it looks better than
> before but we haven’t found the root cause of this. Not so sure if this is
> a jvm tuning problem or a hdfs bug?
>
> Anyone who has met the similar problem in this version? Why the NN old gen
> space greatly increased?
>
> Some information of our env:
> JDK1.8
> 500+ Nodes, 150 million blocks, around 40GB metadata size will be used.
>
> Appreciate if anyone who can share your comments.
>
> Thanks
> Yiqun.
> 本电子邮件可能为保密文件。如果阁下非电子邮件所指定之收件人，谨请立即通知本人。敬请阁下不要使用、保存、复印、打印、散布本电子邮件及其内容，或将其用于其他任何目的或向任何人披露。谢谢您的合作！
> This communication is intended only for the addressee(s) and may contain
> information that is privileged and confidential. You are hereby notified
> that, if you are not an intended recipient listed above, or an authorized
> employee or agent of an addressee of this communication responsible for
> delivering e-mail messages to an intended recipient, any dissemination,
> distribution or reproduction of this communication (including any
> attachments hereto) is strictly prohibited. If you have received this
> communication in error, please notify us immediately by a reply e-mail
> addressed to the sender and permanently delete the original e-mail
> communication and any attachments from all storage devices without making
> or otherwise retaining a copy.
>


-- 
A very happy Clouderan

[jira] [Resolved] (HDFS-13830) Backport HDFS-13141 to branch-3.0: WebHDFS: Add support for getting snasphottable directory list

2018-09-21 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-13830.

   Resolution: Fixed
Fix Version/s: 3.0.4

Pushed to branch-3.0. Thanks [~smeng]!

> Backport HDFS-13141 to branch-3.0: WebHDFS: Add support for getting 
> snasphottable directory list
> 
>
> Key: HDFS-13830
> URL: https://issues.apache.org/jira/browse/HDFS-13830
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: webhdfs
>Affects Versions: 3.0.3
>Reporter: Siyao Meng
>Assignee: Siyao Meng
>Priority: Major
> Fix For: 3.0.4
>
> Attachments: HDFS-13830.branch-3.0.001.patch, 
> HDFS-13830.branch-3.0.002.patch, HDFS-13830.branch-3.0.003.patch, 
> HDFS-13830.branch-3.0.004.patch
>
>
> HDFS-13141 conflicts with 3.0.3 because of interface change in HdfsFileStatus.
> This Jira aims to backport the WebHDFS getSnapshottableDirListing() support 
> to branch-3.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Reopened] (HDFS-13831) Make block increment deletion number configurable

2018-08-28 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang reopened HDFS-13831:


Sorry to reopen. There are minor code conflicts in branch-3.0. Will attach 
branch-3.0 patch for recommit check.

> Make block increment deletion number configurable
> -
>
> Key: HDFS-13831
> URL: https://issues.apache.org/jira/browse/HDFS-13831
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Affects Versions: 3.1.0
>Reporter: Yiqun Lin
>Assignee: Ryan Wu
>Priority: Major
> Fix For: 2.10.0, 3.2.0, 3.1.2
>
> Attachments: HDFS-13831.001.patch, HDFS-13831.002.patch, 
> HDFS-13831.003.patch, HDFS-13831.004.patch
>
>
> When NN deletes a large directory, it will hold the write lock long time. For 
> improving this, we remove the blocks in a batch way. So that other waiters 
> have a chance to get the lock. But right now, the batch number is a 
> hard-coded value.
> {code}
>   static int BLOCK_DELETION_INCREMENT = 1000;
> {code}
> We can make this value configurable, so that we can control the frequency of 
> other waiters to get the lock chance. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-3584) Blocks are getting marked as corrupt with append operation under high load.

2018-08-21 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-3584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-3584.
---
  Resolution: Duplicate
Target Version/s:   (was: )

Close this one. Thanks for reminder [~elgoiri]

> Blocks are getting marked as corrupt with append operation under high load.
> ---
>
> Key: HDFS-3584
> URL: https://issues.apache.org/jira/browse/HDFS-3584
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.0.0-alpha
>Reporter: Brahma Reddy Battula
>Priority: Major
>
> Scenario:
> = 
> 1. There are 2 clients cli1 and cli2 cli1 write a file F1 and not closed
> 2. The cli2 will call append on unclosed file and triggers a leaserecovery
> 3. Cli1 is closed
> 4. Lease recovery is completed and with updated GS in DN and got BlockReport 
> since there is a mismatch in GS the block got corrupted
> 5. Now we got a CommitBlockSync this will also fail since the File is already 
> closed by cli1 and state in NN is Finalized



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-8718) Block replicating cannot work after upgrading to 2.7

2018-08-09 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-8718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-8718.
---
Resolution: Duplicate

Thanks [~hexiaoqiao] for identifying the duplicates and [~jiangbinglover] for 
reporting this issue.

I'll go ahead and close this Jira as a dup of HDFS-10453.

> Block replicating cannot work after upgrading to 2.7 
> -
>
> Key: HDFS-8718
> URL: https://issues.apache.org/jira/browse/HDFS-8718
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Bing Jiang
>Priority: Major
>
> Decommission a datanode from hadoop, and hdfs can calculate the correct 
> number of  blocks to be replicated from web-ui. 
> {code}
> Decomissioning
> Node  Last contactUnder replicated blocks Blocks with no live replicas
> Under Replicated Blocks 
> In files under construction
> TS-BHTEST-03:50010 (172.22.49.3:50010)25641   0   0
> {code}
> From NN's log, the work of block replicating cannot be enforced due to 
> inconsistent expected storage type.
> {code}
> Node /default/rack_02/172.22.49.5:50010 [
>   Storage 
> [DISK]DS-3915533b-4ae4-4806-bf83caf1446f1e2f:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-3e54c331-3eaf-4447-b5e4-9bf91bc71b17:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-d44fa611-aa73-4415-a2de-7e73c9c5ea68:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-cebbf410-06a0-4171-a9bd-d0db55dad6d3:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-4c50b1c7-eaad-4858-b476-99dec17d68b5:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-f6cf9123-4125-4234-8e21-34b12170e576:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-7601b634-1761-45cc-9ffd-73ee8687c2a7:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-1d4b91ab-fe2f-4d5f-bd0a-57e9a0714654:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-cd2279cf-9c5a-4380-8c41-7681fa688eaf:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-630c734f-334a-466d-9649-4818d6e91181:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
>   Storage 
> [DISK]DS-31cd0d68-5f7c-4a0a-91e6-afa53c4df820:NORMAL:172.22.49.5:50010 is not 
> chosen since storage types do not match, where the required storage type is 
> ARCHIVE.
> ]
> 2015-07-07 16:00:22,032 WARN 
> org.apache.hadoop.hdfs.protocol.BlockStoragePolicy: Failed to place enough 
> replicas: expected size is 1 but onl
> y 0 storage types can be selected (replication=3, selected=[], 
> unavailable=[DISK, ARCHIVE], removed=[DISK], policy=BlockStoragePolicy{HOT:7,
>  storageTypes=[DISK], creationFallbacks=[], replicationFallbacks=[ARCHIVE]})
> 2015-07-07 16:00:22,032 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
> place enough replicas, still in n
> eed of 1 to reach 3 (unavailableStorages=[DISK, ARCHIVE], 
> storagePolicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], 
> creationFallbacks=[],
>  replicationFallbacks=[ARCHIVE]}, newBlock=false) All required storage types 
> are unavailable:  unavailableStorages=[DISK, ARCHIVE], storageP
> olicy=BlockStoragePolicy{HOT:7, storageTypes=[DISK], creationFallbacks=[], 
> replicationFallbacks=[ARCHIVE]}
> {code}
> We have upgraded the hadoop cluster from 2.5 to 2.7.0 previously. I believe 
> the feature of ARCHIVE STORAGE has been enforced, but how about the block's 
> storage type after upgrading?
> The default BlockStoragePolicy is hot, and I guess those blocks do not have 
> the correct information bit of BlockStoragePolicy, so it cannot be handled 
> well.
> After I shutdown the datanode, the under-replicated blocks can be asked to 
> copy. So

[jira] [Created] (HDFS-13758) DatanodeManager should throw exception if it has BlockRecoveryCommand but the block is not under construction

2018-07-20 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13758:
--

 Summary: DatanodeManager should throw exception if it has 
BlockRecoveryCommand but the block is not under construction
 Key: HDFS-13758
 URL: https://issues.apache.org/jira/browse/HDFS-13758
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0-alpha1
Reporter: Wei-Chiu Chuang


In Hadoop 3, HDFS-8909 added an assertion assumption that if a 
BlockRecoveryCommand exists for a block, the block is under construction.

 
{code:title=DatanodeManager#getBlockRecoveryCommand()}

  BlockRecoveryCommand brCommand = new BlockRecoveryCommand(blocks.length);
  for (BlockInfo b : blocks) {
BlockUnderConstructionFeature uc = b.getUnderConstructionFeature();
assert uc != null;
...
{code}
This assertion accidentally fixed one of the possible scenario of HDFS-10240 
data corruption, if a recoverLease() is made immediately followed by a close(), 
before DataNodes have the chance to heartbeat.

In a unit test you'll get:
{noformat}
2018-07-19 09:43:41,331 [IPC Server handler 9 on 57890] WARN  ipc.Server 
(Server.java:logException(2724)) - IPC Server handler 9 on 57890, call Call#41 
Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeProtocol.sendHeartbeat 
from 127.0.0.1:57903
java.lang.AssertionError
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getBlockRecoveryCommand(DatanodeManager.java:1551)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleHeartbeat(DatanodeManager.java:1661)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleHeartbeat(FSNamesystem.java:3865)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendHeartbeat(NameNodeRpcServer.java:1504)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.sendHeartbeat(DatanodeProtocolServerSideTranslatorPB.java:119)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:31660)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
{noformat}

I propose to change this assertion even though it address the data corruption, 
because:
# We should throw an more meaningful exception than an NPE
# on a production cluster, the assert is ignored, and you'll get a more 
noticeable NPE. Future HDFS developers might fix this NPE, causing regression. 
An NPE is typically not captured and handled, so there's a chance to result in 
internal state inconsistency.
# It doesn't address all possible scenarios of HDFS-10240. A proper fix should 
reject close() if the block is being recovered.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13757) After HDFS-12886, close() can throw AssertionError "Negative replicas!"

2018-07-20 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13757:
--

 Summary: After HDFS-12886, close() can throw AssertionError 
"Negative replicas!"
 Key: HDFS-13757
 URL: https://issues.apache.org/jira/browse/HDFS-13757
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.3, 2.9.1, 3.1.0, 2.10.0, 3.2.0
Reporter: Wei-Chiu Chuang


While investigating a data corruption bug caused by concurrent recoverLease() 
and close(), I found HDFS-12886 may cause close() to throw AssertionError under 
a corner case, because the block has zero live replica, and client calls 
recoverLease() immediately followed by close().
{noformat}
org.apache.hadoop.ipc.RemoteException(java.lang.AssertionError): Negative 
replicas!
at 
org.apache.hadoop.hdfs.server.blockmanagement.LowRedundancyBlocks.getPriority(LowRedundancyBlocks.java:197)
at 
org.apache.hadoop.hdfs.server.blockmanagement.LowRedundancyBlocks.update(LowRedundancyBlocks.java:422)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.updateNeededReconstructions(BlockManager.java:4274)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.commitOrCompleteLastBlock(BlockManager.java:1001)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitOrCompleteLastBlock(FSNamesystem.java:3471)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFileInternal(FSDirWriteFileOp.java:713)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.completeFile(FSDirWriteFileOp.java:671)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.completeFile(FSNamesystem.java:2854)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.complete(NameNodeRpcServer.java:928)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.complete(ClientNamenodeProtocolServerSideTranslatorPB.java:607)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1689)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
{noformat}
I have a test case to reproduce it.

[~lukmajercak] [~elgoiri] would you please take a look at it? I think we should 
add a check to reject completeFile() if the block is under recovery, similar to 
what's proposed in HDFS-10240.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13738) fsck -list-corruptfileblocks has infinite loop if user is not privileged.

2018-07-16 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13738:
--

 Summary: fsck -list-corruptfileblocks has infinite loop if user is 
not privileged.
 Key: HDFS-13738
 URL: https://issues.apache.org/jira/browse/HDFS-13738
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: tools
Affects Versions: 3.0.0, 2.6.0
 Environment: Kerberized Hadoop cluster
Reporter: Wei-Chiu Chuang


Execute following command as nay non-privileged user:
{noformat}
# create an empty directory
$ hdfs dfs -mkdir /tmp/fsck_test
# run fsck
$ hdfs fsck /tmp/fsck_test -list-corruptfileblocks
{noformat}

{noformat}
FSCK ended at Mon Jul 16 15:14:03 PDT 2018 in 1 milliseconds
Access denied for user systest. Superuser privilege is required
Fsck on path '/tmp' FAILED
FSCK ended at Mon Jul 16 15:14:03 PDT 2018 in 0 milliseconds
Access denied for user systest. Superuser privilege is required
Fsck on path '/tmp' FAILED
FSCK ended at Mon Jul 16 15:14:03 PDT 2018 in 1 milliseconds
Access denied for user systest. Superuser privilege is required
Fsck on path '/tmp' FAILED
{noformat}

Reproducible on Hadoop 3.0.0 as well as 2.6.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13730) BlockReaderRemote.sendReadResult throws NPE

2018-07-11 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13730:
--

 Summary: BlockReaderRemote.sendReadResult throws NPE
 Key: HDFS-13730
 URL: https://issues.apache.org/jira/browse/HDFS-13730
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs-client
 Environment: Hadoop 3.0.0, HBase 2.0.0 + HBASE-20403.
Reporter: Wei-Chiu Chuang


Found the following exception thrown in a HBase RegionServer log (Hadoop 3.0.0 
+ HBase 2.0.0. The hbase prefetch bug HBASE-20403 was fixed on this cluster, 
but I am not sure if that's related at all):
{noformat}
2018-07-11 11:10:44,462 WARN org.apache.hadoop.hbase.io.hfile.HFileReaderImpl: 
Stream moved/closed or prefetch 
cancelled?path=hdfs://ns1/hbase/data/default/IntegrationTestBigLinkedList_20180711003954/449fa9bf5a7483295493258b5af50abc/meta/e9de0683f8a9413a94183c752bea0ca5,
 offset=216505135,
end=2309991906
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.net.NioInetPeer.getRemoteAddressString(NioInetPeer.java:99)
at 
org.apache.hadoop.hdfs.net.EncryptedPeer.getRemoteAddressString(EncryptedPeer.java:105)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.sendReadResult(BlockReaderRemote.java:330)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.readNextPacket(BlockReaderRemote.java:233)
at 
org.apache.hadoop.hdfs.client.impl.BlockReaderRemote.read(BlockReaderRemote.java:165)
at 
org.apache.hadoop.hdfs.DFSInputStream.actualGetFromOneDataNode(DFSInputStream.java:1050)
at 
org.apache.hadoop.hdfs.DFSInputStream.fetchBlockByteRange(DFSInputStream.java:992)
at org.apache.hadoop.hdfs.DFSInputStream.pread(DFSInputStream.java:1348)
at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:1312)
at org.apache.hadoop.crypto.CryptoInputStream.read(CryptoInputStream.java:331)
at org.apache.hadoop.fs.FSDataInputStream.read(FSDataInputStream.java:92)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock.positionalReadWithExtra(HFileBlock.java:805)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readAtOffset(HFileBlock.java:1565)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockDataInternal(HFileBlock.java:1769)
at 
org.apache.hadoop.hbase.io.hfile.HFileBlock$FSReaderImpl.readBlockData(HFileBlock.java:1594)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl.readBlock(HFileReaderImpl.java:1488)
at 
org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$1.run(HFileReaderImpl.java:278)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
The relevant Hadoop code:
{code:java|title=BlockReaderRemote#sendReadResult}
void sendReadResult(Status statusCode) {
  assert !sentStatusCode : "already sent status code to " + peer;
  try {
writeReadResult(peer.getOutputStream(), statusCode);
sentStatusCode = true;
  } catch (IOException e) {
// It's ok not to be able to send this. But something is probably wrong.
LOG.info("Could not send read status (" + statusCode + ") to datanode " +
peer.getRemoteAddressString() + ": " + e.getMessage());
  }
}
{code}
So the NPE was thrown within a exception handler. A possible explanation could 
be that the socket was closed so client couldn't write, and 
Socket#getRemoteSocketAddress() returns null when the socket is closed.

Suggest check for nullity and return an empty string in
{noformat}
NioInetPeer.getRemoteAddressString{noformat}
.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [VOTE] reset/force push to clean up inadvertent merge commit pushed to trunk

2018-07-05 Thread Wei-Chiu Chuang

I'm sorry I come to this thread late.
Anu commented on INFRA-16727 saying he reverted the commit. Do we still
need the vote?

Thanks

On Thu, Jul 5, 2018 at 2:47 PM Rohith Sharma K S 
wrote:

> +1
>
> On 5 July 2018 at 14:37, Subru Krishnan  wrote:
>
> > Folks,
> >
> > There was a merge commit accidentally pushed to trunk, you can find the
> > details in the mail thread [1].
> >
> > I have raised an INFRA ticket [2] to reset/force push to clean up trunk.
> >
> > Can we have a quick vote for INFRA sign-off to proceed as this is
> blocking
> > all commits?
> >
> > Thanks,
> > Subru
> >
> > [1]
> > http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201807.mbox/%
> > 3CCAHqguubKBqwfUMwhtJuSD7X1Bgfro_P6FV%2BhhFhMMYRaxFsF9Q%
> > 40mail.gmail.com%3E
> > [2] https://issues.apache.org/jira/browse/INFRA-16727
> >
>
> --
> A very happy Hadoop contributor
>

[jira] [Created] (HDFS-13672) clearCorruptLazyPersistFiles could crash NameNode

2018-06-12 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13672:
--

 Summary: clearCorruptLazyPersistFiles could crash NameNode
 Key: HDFS-13672
 URL: https://issues.apache.org/jira/browse/HDFS-13672
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang


I started a NameNode on a pretty large fsimage. Since the NameNode is started 
without any DataNodes, all blocks (100 million) are "corrupt".

Afterwards I observed FSNamesystem#clearCorruptLazyPersistFiles() held write 
lock for a long time:

{noformat}
18/06/12 12:37:03 INFO namenode.FSNamesystem: FSNamesystem write lock held for 
46024 ms via
java.lang.Thread.getStackTrace(Thread.java:1559)
org.apache.hadoop.util.StringUtils.getStackTrace(StringUtils.java:945)
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:198)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.writeUnlock(FSNamesystem.java:1689)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.clearCorruptLazyPersistFiles(FSNamesystem.java:5532)
org.apache.hadoop.hdfs.server.namenode.FSNamesystem$LazyPersistFileScrubber.run(FSNamesystem.java:5543)
java.lang.Thread.run(Thread.java:748)
Number of suppressed write-lock reports: 0
Longest write-lock held interval: 46024
{noformat}

Here's the relevant code:

{code}
  writeLock();

  try {
final Iterator it =
blockManager.getCorruptReplicaBlockIterator();

while (it.hasNext()) {
  Block b = it.next();
  BlockInfo blockInfo = blockManager.getStoredBlock(b);
  if (blockInfo.getBlockCollection().getStoragePolicyID() == 
lpPolicy.getId()) {
filesToDelete.add(blockInfo.getBlockCollection());
  }
}

for (BlockCollection bc : filesToDelete) {
  LOG.warn("Removing lazyPersist file " + bc.getName() + " with no 
replicas.");
  changed |= deleteInternal(bc.getName(), false, false, false);
}
  } finally {
writeUnlock();
  }
{code}
In essence, the iteration over corrupt replica list should be broken down into 
smaller iterations to avoid a single long wait.

Since this operation holds NameNode write lock for more than 45 seconds, the 
default ZKFC connection timeout, it implies an extreme case like this (100 
million corrupt blocks) could lead to NameNode failover.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13667) Typo: Marking all "datandoes" as stale

2018-06-08 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13667:
--

 Summary: Typo: Marking all "datandoes" as stale
 Key: HDFS-13667
 URL: https://issues.apache.org/jira/browse/HDFS-13667
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Wei-Chiu Chuang


{code:title=DatanodeManager#markAllDatanodesStale}
  public void markAllDatanodesStale() {
LOG.info("Marking all datandoes as stale");
synchronized (this) {
  for (DatanodeDescriptor dn : datanodeMap.values()) {
for(DatanodeStorageInfo storage : dn.getStorageInfos()) {
  storage.markStaleAfterFailover();
}
  }
}
  }

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-13638) DataNode Can't replicate block because NameNode thinks the length is 9223372036854775807

2018-06-07 Thread Wei-Chiu Chuang (JIRA)



 [ 
https://issues.apache.org/jira/browse/HDFS-13638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-13638.

Resolution: Duplicate

Okay I think this is fixed by HDFS-10453. Resolve this jira. Thanks 
[~hexiaoqiao]!

> DataNode Can't replicate block because NameNode thinks the length is 
> 9223372036854775807
> 
>
> Key: HDFS-13638
> URL: https://issues.apache.org/jira/browse/HDFS-13638
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>    Reporter: Wei-Chiu Chuang
>Priority: Major
>
> I occasionally find the following warning in CDH clusters, but haven't 
> figured out why. Thought I should better raise the issue anyway.
> {quote}
> 2018-05-29 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Can't replicate block 
> BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
> on-disk length 175085 is shorter than NameNode recorded length 
> 9223372036854775807
> {quote}
> Infact, 9223372036854775807 = Long.MAX_VALUE.
> Chasing in the HDFS codebase but didn't find where this length could come from



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13662) TestBlockReaderLocal#testStatisticsForErasureCodingRead is flaky

2018-06-07 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13662:
--

 Summary: TestBlockReaderLocal#testStatisticsForErasureCodingRead 
is flaky
 Key: HDFS-13662
 URL: https://issues.apache.org/jira/browse/HDFS-13662
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding, test
Reporter: Wei-Chiu Chuang


The test failed in this precommit for a patch that only modifies an unrelated 
test.
https://builds.apache.org/job/PreCommit-HDFS-Build/24401/testReport/org.apache.hadoop.hdfs.client.impl/TestBlockReaderLocal/testStatisticsForErasureCodingRead/

This test also failed occasionally in our internal test.

{noformat}
Stacktrace
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.hdfs.client.impl.TestBlockReaderLocal.testStatisticsForErasureCodingRead(TestBlockReaderLocal.java:842)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13659) Add more test coverage for contentSummary for snapshottable path

2018-06-06 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13659:
--

 Summary: Add more test coverage for contentSummary for 
snapshottable path 
 Key: HDFS-13659
 URL: https://issues.apache.org/jira/browse/HDFS-13659
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


HDFS-9063 corrected the behavior of contentSummary for snapshots. This jira 
proposes adding more tests to cover more scenarios:
# create a file, create snapshot, and then update the file
# after snapshot is created, delete a file



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13638) DataNode Can't replicate block because NameNode thinks the length is 9223372036854775807

2018-05-29 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13638:
--

 Summary: DataNode Can't replicate block because NameNode thinks 
the length is 9223372036854775807
 Key: HDFS-13638
 URL: https://issues.apache.org/jira/browse/HDFS-13638
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Wei-Chiu Chuang


I occasionally find the following warning in CDH clusters, but haven't figured 
out why. Thought I should better raise the issue anyway.
{noformat}
2018-05-29 09:15:58,092 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Can't replicate block 
BP-725378529-10.0.0.8-1410027444173:blk_13276745777_1112363330268 because 
on-disk length 175085 is shorter than NameNode recorded length 
9223372036854775807{noformat}
Infact, 9223372036854775807 = Long.MAX_VALUE.

Chasing in the HDFS codebase but didn't find where this length could come from



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13635) Incorrect message when block is not found

2018-05-29 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13635:
--

 Summary: Incorrect message when block is not found
 Key: HDFS-13635
 URL: https://issues.apache.org/jira/browse/HDFS-13635
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Wei-Chiu Chuang


When client opens a file, it asks DataNode to check the blocks' visible length. 
If somehow the block is not on the DN, it throws "Cannot append to a 
non-existent replica" message, which is incorrect, because 
getReplicaVisibleLength() is called for different use, just not for appending 
to a block.

The following stacktrace comes from a CDH5.13, but it looks like the same 
warning exists in Apache Hadoop trunk.
{noformat}
2018-05-29 09:23:41,966 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 
on 50020, call 
org.apache.hadoop.hdfs.protocol.ClientDatanodeProtocol.getReplicaVisibleLength 
from 10.0.0.14:53217 Call#38334117 Retry#0
org.apache.hadoop.hdfs.server.datanode.ReplicaNotFoundException: Cannot append 
to a non-existent replica BP-725378529-10.236.236.8-1410027444173:13276792346
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaInfo(FsDatasetImpl.java:792)
 at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.getReplicaVisibleLength(FsDatasetImpl.java:2588)
 at 
org.apache.hadoop.hdfs.server.datanode.DataNode.getReplicaVisibleLength(DataNode.java:2756)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientDatanodeProtocolServerSideTranslatorPB.getReplicaVisibleLength(ClientDatanodeProtocolServerSideTranslatorPB.java:107)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientDatanodeProtocolProtos$ClientDatanodeProtocolService$2.callBlockingMethod(ClientDatanodeProtocolProtos.java:17873)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2217)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2213)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2211){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13613) RegionServer log is flooded with "Execution rejected, Executing in current thread"

2018-05-23 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13613:
--

 Summary: RegionServer log is flooded with "Execution rejected, 
Executing in current thread"
 Key: HDFS-13613
 URL: https://issues.apache.org/jira/browse/HDFS-13613
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.4.0
 Environment: CDH 5.13, HBase RegionServer, Kerberized, hedged read
Reporter: Wei-Chiu Chuang


In the log of a HBase RegionServer with hedged read, we saw the following 
message flooding the log file.
{noformat}
2018-05-19 17:22:55,691 INFO org.apache.hadoop.hdfs.DFSClient: Execution 
rejected, Executing in current thread
2018-05-19 17:22:55,692 INFO org.apache.hadoop.hdfs.DFSClient: Execution 
rejected, Executing in current thread
2018-05-19 17:22:55,695 INFO org.apache.hadoop.hdfs.DFSClient: Execution 
rejected, Executing in current thread
2018-05-19 17:22:55,696 INFO org.apache.hadoop.hdfs.DFSClient: Execution 
rejected, Executing in current thread
2018-05-19 17:22:55,696 INFO org.apache.hadoop.hdfs.DFSClient: Execution 
rejected, Executing in current thread

{noformat}
Sometimes the RS spits tens of thousands of lines of this message in a minute. 
We should do something to stop this message flooding the log file. Also, we 
should make this message more actionable. Discussed with [~huaxiang], this 
message can appear if there are stale DataNodes.

I believe this issue existed since HDFS-5776.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13612) Short-circuit read: unknown response code ERROR

2018-05-23 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13612:
--

 Summary: Short-circuit read: unknown response code ERROR
 Key: HDFS-13612
 URL: https://issues.apache.org/jira/browse/HDFS-13612
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.6.0
 Environment: CDH5.13.3, Kerberized, HBase, hedged reader, short 
circuit read.
Reporter: Wei-Chiu Chuang


Found the following warning in a HBase RS log:

2018-05-19 13:13:00,310 WARN org.apache.hadoop.hdfs.BlockReaderFactory: 
BlockReaderFactory(fileName=, 
block=BP-297993939-10.0.0.1-1402080332426:blk_1816287870_744594480): unknown 
response code ERROR while attempting to set up short-circuit access. Block 
BP-297993939-10.0.0.1-1402080332426:blk_1816287870_744594480 is not valid

Checking the code, if the request fails because of block access token (invalid 
access token), BlockReaderFactory#requestFileDescriptors expects error code 
ERROR_ACCESS_TOKEN. However, on the DataNode side, 
DataXceiver#requestShortCircuitFds emits ERROR if an IOException is thrown.

So ERROR is expected and BlockReaderFactory#requestFileDescriptors should 
handle it better.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-6359) WebHdfs NN servlet issues redirects in safemode or standby

2018-05-15 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-6359.
---
Resolution: Duplicate

I believe this is a dup of HDFS-5122

> WebHdfs NN servlet issues redirects in safemode or standby
> --
>
> Key: HDFS-6359
> URL: https://issues.apache.org/jira/browse/HDFS-6359
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: webhdfs
>Affects Versions: 2.0.0-alpha, 3.0.0-alpha1
>Reporter: Daryn Sharp
>Assignee: Daryn Sharp
>Priority: Critical
>
> Webhdfs does not check for safemode or standby during issuing a redirect for 
> open/create/checksum calls.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-6371) In HA setup, the standby NN should redirect WebHDFS write requests to the active NN

2018-05-15 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-6371.
---
Resolution: Duplicate

I believe this is a dup of HDFS-5122

> In HA setup, the standby NN should redirect WebHDFS write requests to the 
> active NN
> ---
>
> Key: HDFS-6371
> URL: https://issues.apache.org/jira/browse/HDFS-6371
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: namenode, webhdfs
>Reporter: Tsz Wo Nicholas Sze
>Assignee: Tsz Wo Nicholas Sze
>Priority: Major
>
> The current WebHDFS implementation in namenode does not check its HA state -- 
> it does the same thing no matter it is active or standby.
> Suppose a http client talk to the standby NN via WebHDFS.  For the read 
> operations, there is no problem.  For the write operations, if the operation 
> requires http redirect (e.g. creating a file), it will work since the standby 
> NN will also redirect the client to a DN.  When the client connect to the DN, 
> the DN will fulfill the request with the active NN.  However, for the write 
> operations not requiring http redirect (e.g. mkdir), the operation will fail 
> with StandbyException since it will be executed with the standby NN.
> There are two solutions:
> # The http client could catch StandbyException and then retries with the 
> other NN in this case.
> # The standby NN redirects the request to the active NN.
> The second solution seems better since the client does not need to know both 
> active NN and standby NN.
> Note that WebHdfsFileSystem is already able to handle HA failover.  The JIRA 
> is for other http clients.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-4702) remove namesystem lock from DatanodeManager#fetchDatanodes

2018-05-15 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-4702?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-4702.
---
  Resolution: Duplicate
Target Version/s:   (was: )

Looking at code, this lock is removed as part of HDFS-5693.

> remove namesystem lock from DatanodeManager#fetchDatanodes
> --
>
> Key: HDFS-4702
> URL: https://issues.apache.org/jira/browse/HDFS-4702
> Project: Hadoop HDFS
>  Issue Type: Sub-task
>  Components: namenode
>Affects Versions: 3.0.0-alpha1
>Reporter: Chris Nauroth
>Assignee: Chris Nauroth
>Priority: Major
>
> {{DatanodeManager#fetchDatanodes}} currently holds the namesystem read lock 
> while iterating through data nodes.  This method is called from the namenode 
> web UI.  HDFS-3990 reported a performance problem in this code path.  This is 
> a follow-up jira to investigate whether or not we can remove the lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-5123) Hftp should support namenode logical service names in URI

2018-05-15 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-5123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-5123.
---
Resolution: Won't Fix

hftp is deprecated and removed in Hadoop3. Resolve it as won't fix.

> Hftp should support namenode logical service names in URI
> -
>
> Key: HDFS-5123
> URL: https://issues.apache.org/jira/browse/HDFS-5123
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ha
>Affects Versions: 2.1.0-beta
>Reporter: Arpit Gupta
>Priority: Major
>
> For example if the dfs.nameservices is set to arpit
> {code}
> hdfs dfs -ls hftp://arpit:50070/tmp
> or 
> hdfs dfs -ls hftp://arpit/tmp
> {code}
> does not work
> You have to provide the exact active namenode hostname. On an HA cluster 
> using dfs client one should not need to provide the active nn hostname



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-8406) Lease recovery continually failed

2018-05-12 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-8406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-8406.
---
Resolution: Duplicate

> Lease recovery continually failed
> -
>
> Key: HDFS-8406
> URL: https://issues.apache.org/jira/browse/HDFS-8406
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Keith Turner
>Priority: Major
>  Labels: Accumulo, HBase, SolrCloud
>
> While testing Accumulo on a cluster and killing processes, I ran into a 
> situation where the lease on an accumulo write ahead log in HDFS could not be 
> recovered.   Even restarting HDFS and Accumulo would not fix the problem.
> The following message was seen in an Accumulo tablet server log immediately 
> before the tablet server was killed.
> {noformat}
> 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : DFSOutputStream 
> ResponseProcessor exception  for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060
> java.io.IOException: Bad response ERROR for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 from datanode 
> 10.1.5.9:50010
> at 
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:897)
> 2015-05-14 17:12:37,466 [hdfs.DFSClient] WARN : Error Recovery for block 
> BP-802741494-10.1.5.6-1431557089849:blk_1073932823_192060 in pipeline 
> 10.1.5.55:50010, 10.1.5.9:5
> {noformat}
> Before recovering data from a write ahead log, the Accumulo master attempts 
> to recover the lease.   This repeatedly failed with messages like the 
> following.
> {noformat}
> 2015-05-14 17:14:54,301 [recovery.HadoopLogCloser] WARN : Error recovering 
> lease on 
> hdfs://10.1.5.6:1/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException):
>  failed to create file 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
> DFSClient_NONMAPREDUCE_950713214_16 for client 10.1.5.158 because 
> pendingCreates is non-null but no leases found.
> {noformat}
> Below is some info from the NN logs for the problematic file.
> {noformat}
> [ec2-user@leader2 logs]$ grep 3a731759-3594-4535-8086-245 
> hadoop-ec2-user-namenode-leader2.log 
> 2015-05-14 17:10:46,299 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> allocateBlock: 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2. 
> BP-802741494-10.1.5.6-1431557089849 
> blk_1073932823_192060{blockUCState=UNDER_CONSTRUCTION, primaryNodeIndex=-1, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
> 2015-05-14 17:10:46,628 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
> fsync: /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 for 
> DFSClient_NONMAPREDUCE_-1128465883_16
> 2015-05-14 17:14:49,288 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: recoverLease: [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
> src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 from 
> client DFSClient_NONMAPREDUCE_-1128465883_16
> 2015-05-14 17:14:49,288 INFO 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Recovering [Lease.  
> Holder: DFSClient_NONMAPREDUCE_-1128465883_16, pendingcreates: 1], 
> src=/accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2
> 2015-05-14 17:14:49,289 WARN org.apache.hadoop.hdfs.StateChange: DIR* 
> NameSystem.internalReleaseLease: File 
> /accumulo/wal/worker11+9997/3a731759-3594-4535-8086-245eed7cd4c2 has not been 
> closed. Lease recovery is in progress. RecoveryId = 192257 for block 
> blk_1073932823_192060{blockUCState=UNDER_RECOVERY, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]DS-ffe07d7d-0e68-45b8-b3d5-c976f1716481:NORMAL:10.1.5.55:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-6efec702-3f1f-4ec0-a31f-de947e7e6097:NORMAL:10.1.5.9:50010|RBW],
>  
> ReplicaUnderConstruction[[DISK]DS-5e27df17-abf8-47df-b4bc-c38d0cd426ea:NORMAL:10.1.5.45:50010|RBW]]}
> java.lang.IllegalStateException: Failed to finalize INodeFile 
> 3a731759-3594-4535-8086-245eed7cd4c2 since blocks[0] is non-complete, where 
> blocks=[blk_1073932823_192257{blockUCState=COMMITTED, primaryNodeIndex=2, 
> replicas=[ReplicaUnderConstruction[[DISK]

Re: Apache Hadoop 3.0.3 Release plan

2018-05-08 Thread Wei-Chiu Chuang

Thanks Yongjun for driving 3.0.3 release!

IMHO, could we consider adding YARN-7190
 into the list?
I understand that it is listed as an incompatible change, however, because
of this bug, HBase considers the entire Hadoop 3.0.x line not production
ready. I feel there's not much point releasing any more 3.0.x releases if
downstream projects can't pick it up (after the fact that HBase is one of
the most important projects around Hadoop).

On Mon, May 7, 2018 at 1:19 PM, Yongjun Zhang  wrote:

> Hi Eric,
>
> Thanks for the feedback, good point. I will try to clean up things, then
> cut branch before the release production and vote.
>
> Best,
>
> --Yongjun
>
> On Mon, May 7, 2018 at 8:39 AM, Eric Payne  invalid
> > wrote:
>
> > >  We plan to cut branch-3.0.3 by the coming Wednesday (May 9th) and vote
> > for RC on May 30th
> > I much prefer to wait to cut the branch until just before the production
> > of the release and the vote. With so many branches, we sometimes miss
> > putting critical bug fixes in unreleased branches if the branch is cut
> too
> > early.
> >
> > My 2 cents...
> > Thanks,
> > -Eric Payne
> >
> >
> >
> >
> >
> > On Monday, May 7, 2018, 12:09:00 AM CDT, Yongjun Zhang <
> > yjzhan...@apache.org> wrote:
> >
> >
> >
> >
> >
> > Hi All,
> >
> > >
> > We have released Apache Hadoop 3.0.2 in April of this year [1]. Since
> then,
> > there are quite some commits done to branch-3.0. To further improve the
> > quality of release, we plan to do 3.0.3 release now. The focus of 3.0.3
> > will be fixing blockers (3), critical bugs (17) and bug fixes (~130), see
> > [2].
> >
> > Usually no new feature should be included for maintenance releases, I
> > noticed we have https://issues.apache.org/jira/browse/HADOOP-13055 in
> the
> > branch classified as new feature. I will talk with the developers to see
> if
> > we should include it in 3.0.3.
> >
> > I also noticed that there are more commits in the branch than can be
> found
> > by query [2], also some commits committed to 3.0.3 do not have their jira
> > target release field filled in accordingly. I will go through them to
> > update the jira.
> >
> > >
> > We plan to cut branch-3.0.3 by the coming Wednesday (May 9th) and vote
> for
> > RC on May 30th, targeting for Jun 8th release.
> >
> > >
> > Your insights are welcome.
> >
> > >
> > [1] https://www.mail-archive.com/general@hadoop.apache.org/msg07790.html
> >
> > > [2] https://issues.apache.org/jira/issues/?filter=12343874  See Note
> > below
> > Note: seems I need some admin change so that I can make the filter in [2]
> > public, I'm working on that. For now, you can use jquery
> > (project = hadoop OR project = "Hadoop HDFS" OR project = "Hadoop YARN"
> OR
> > project = "Hadoop Map/Reduce") AND fixVersion in (3.0.3) ORDER BY
> priority
> > DESC
> >
> > Thanks and best regards,
> >
> > --Yongjun
> >
> > -
> > To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
> > For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org
> >
> >
>



-- 
A very happy Hadoop contributor

Re: [DISCUSSION] Create a branch to work on non-blocking access to HDFS

2018-05-03 Thread Wei-Chiu Chuang

Given that HBase 2 uses async output by default, the way that code is
maintained today in HBase is not sustainable. That piece of code should be
maintained in HDFS. I am +1 as a participant in both communities.

On Thu, May 3, 2018 at 9:14 AM, Stack  wrote:

> Ok with you lot if a few of us open a branch to work on a non-blocking HDFS
> client?
>
> Intent is to finish up the old issue "HDFS-9924 [umbrella] Nonblocking HDFS
> Access". On the foot of this umbrella JIRA is a proposal by the
> heavy-lifter, Duo Zhang. Over in HBase, we have a limited async DFS client
> (written by Duo) that we use making Write-Ahead Logs. We call it
> AsyncFSWAL. It was shipped as the default WAL writer in hbase-2.0.0.
>
> Let me quote Duo from his proposal at the base of HDFS-9924:
>
> We use lots of internal APIs of HDFS to implement the AsyncFSWAL, so it
> is expected that things like HBASE-20244
> 
> ["NoSuchMethodException
> when retrieving private method decryptEncryptedDataEncryptionKey from
> DFSClient"] will happen again and again.
>
> To make life easier, we need to move the async output related code into
> HDFS. The POC [attached as patch on HDFS-9924] shows that option 3 [1] can
> work, so I would like to create a feature branch to implement the async dfs
> client. In general I think there are 4 steps:
>
> 1. Implement an async rpc client with option 3 [1] described above.
> 2. Implement the filesystem APIs which only need to connect to NN, such as
> 'mkdirs'.
> 3. Implement async file read. The problem is the API. For pread I think a
> CompletableFuture is enough, the problem is for the streaming read. Need to
> discuss later.
> 4. Implement async file write. The API will also be a problem, but a more
> important problem is that, if we want to support fan-out, the current logic
> at DN side will make the semantic broken as we can read uncommitted data
> very easily. In HBase it is solved by HBASE-14004
>  but I do not think we
> should keep the broken behavior in HDFS. We need to find a way to deal with
> it.
>
> Comments welcome.
>
> Intent is to make a branch named HDFS-9924 (or should we just do a new
> JIRA?) and to add Duo as a feature branch committer. If all goes well,
> we'll call for a merge VOTE.
>
> Thanks,
> St.Ack
>
> 1.Option 3:  "Use the old protobuf rpc interface and implement a new rpc
> framework. The benefit is that we also do not need port unification service
> at server side and do not need to maintain two implementations at server
> side. And one more thing is that we do not need to upgrade protobuf to
> 3.x."
>



-- 
A very happy Hadoop contributor

[jira] [Created] (HDFS-13524) Occasional "All datanodes are bad" error in TestLargeBlock#testLargeBlockSize

2018-05-02 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13524:
--

 Summary: Occasional "All datanodes are bad" error in 
TestLargeBlock#testLargeBlockSize
 Key: HDFS-13524
 URL: https://issues.apache.org/jira/browse/HDFS-13524
 Project: Hadoop HDFS
  Issue Type: Bug
 Environment: TestLargeBlock#testLargeBlockSize may fail with error:
{quote}
All datanodes 
[DatanodeInfoWithStorage[127.0.0.1:44968,DS-acddd79e-cdf1-4ac5-aac5-e804a2e61600,DISK]]
 are bad. Aborting...
{quote}

Tracing back, the error is due to the stress applied to the host sending a 2GB 
block, causing write pipeline ack read timeout:
{quote}
2017-09-10 22:16:07,285 [DataXceiver for client 
DFSClient_NONMAPREDUCE_998779779_9 at /127.0.0.1:57794 [Receiving block 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001]] INFO  
datanode.DataNode (DataXceiver.java:writeBlock(742)) - Receiving 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001 src: 
/127.0.0.1:57794 dest: /127.0.0.1:44968
2017-09-10 22:16:50,402 [DataXceiver for client 
DFSClient_NONMAPREDUCE_998779779_9 at /127.0.0.1:57794 [Receiving block 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001]] WARN  
datanode.DataNode (BlockReceiver.java:flushOrSync(434)) - Slow flushOrSync took 
5383ms (threshold=300ms), isSync:false, flushTotalNanos=5383638982ns, 
volume=file:/tmp/tmp.1oS3ZfDCwq/src/hadoop/hadoop-hdfs-project/hadoop-hdfs/target/test/data/dfs/data/data1/
2017-09-10 22:17:54,427 [ResponseProcessor for block 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001] WARN  
hdfs.DataStreamer (DataStreamer.java:run(1214)) - Exception for 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001
java.net.SocketTimeoutException: 65000 millis timeout while waiting for channel 
to be ready for read. ch : java.nio.channels.SocketChannel[connected 
local=/127.0.0.1:57794 remote=/127.0.0.1:44968]
at 
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
at 
org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at 
org.apache.hadoop.hdfs.protocolPB.PBHelperClient.vintPrefixed(PBHelperClient.java:434)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:213)
at 
org.apache.hadoop.hdfs.DataStreamer$ResponseProcessor.run(DataStreamer.java:1104)
2017-09-10 22:17:54,432 [DataXceiver for client 
DFSClient_NONMAPREDUCE_998779779_9 at /127.0.0.1:57794 [Receiving block 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001]] INFO  
datanode.DataNode (BlockReceiver.java:receiveBlock(1000)) - Exception for 
BP-682118952-172.26.15.143-1505106964162:blk_1073741825_1001
java.io.IOException: Connection reset by peer
{quote}

Instead of raising read timeout, I suggest increasing cluster size from 
default=1 to 3, so that it has the opportunity to choose a different DN and 
resend.

Suspect this fails after HDFS-13103, in Hadoop 2.8/3.0.0-alpha1 when we 
introduced client acknowledgement read timeout.
Reporter: Wei-Chiu Chuang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-6589) TestDistributedFileSystem.testAllWithNoXmlDefaults failed intermittently

2018-05-02 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-6589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-6589.
---
Resolution: Cannot Reproduce

Resolve as cannot reproduce. The last time I see this bug was 2 years ago. Most 
likely it was a real bug and fixed later.

> TestDistributedFileSystem.testAllWithNoXmlDefaults failed intermittently
> 
>
> Key: HDFS-6589
> URL: https://issues.apache.org/jira/browse/HDFS-6589
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs-client
>Affects Versions: 2.5.0
>Reporter: Yongjun Zhang
>Assignee: Wei-Chiu Chuang
>Priority: Major
>  Labels: flaky-test
>
> https://builds.apache.org/job/PreCommit-HDFS-Build/7207 is clean
> https://builds.apache.org/job/PreCommit-HDFS-Build/7208 has the following 
> failure. The code is essentially the same.
> Running the same test locally doesn't reproduce. A flaky test there.
> {code}
> Stacktrace
> java.lang.AssertionError: null
>   at org.junit.Assert.fail(Assert.java:86)
>   at org.junit.Assert.assertTrue(Assert.java:41)
>   at org.junit.Assert.assertFalse(Assert.java:64)
>   at org.junit.Assert.assertFalse(Assert.java:74)
>   at 
> org.apache.hadoop.hdfs.TestDistributedFileSystem.testDFSClient(TestDistributedFileSystem.java:263)
>   at 
> org.apache.hadoop.hdfs.TestDistributedFileSystem.testAllWithNoXmlDefaults(TestDistributedFileSystem.java:651)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13521) NFS Gateway should support impersonation

2018-05-02 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13521:
--

 Summary: NFS Gateway should support impersonation
 Key: HDFS-13521
 URL: https://issues.apache.org/jira/browse/HDFS-13521
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang


Similar to HDFS-10481, NFS gateway and httpfs are independent processes that 
accept client connections.
NFS Gateway currently solves file permission/ownership problem by running as 
HDFS super user, and then call setOwner() to change file owner.

This is not desirable.
# it adds additional RPC load to NameNode. 
#  this does not support at-rest encryption, because by design, HDFS super user 
cannot access KMS.

This is yet another problem around KMS ACL. [~xiaochen] [~rushabh.shah] 
thoughts?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13520) fuse_dfs to support keytab based login

2018-05-02 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13520:
--

 Summary: fuse_dfs to support keytab based login
 Key: HDFS-13520
 URL: https://issues.apache.org/jira/browse/HDFS-13520
 Project: Hadoop HDFS
  Issue Type: Improvement
Affects Versions: 2.6.0
 Environment: Hadoop 2.6/3.0, Kerberized, fuse_dfs
Reporter: Wei-Chiu Chuang


It looks like the current fuse_dfs implementation supports login using current 
kerberos credential. If the tgt expires, it fails with the following error:
{noformat}
hdfsBuilderConnect(forceNewInstance=1, nn=hdfs://ns1, port=0, 
kerbTicketCachePath=/tmp/krb5cc_2000, userName=systest) error:
LoginException: Unable to obtain Principal Name for authentication 
org.apache.hadoop.security.KerberosAuthException: failure to login: for user: 
systest using ticket cache file: /tmp/krb5cc_2000 
javax.security.auth.login.LoginException: Unable to obtain Principal Name for 
authentication
at 
org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:807)
at 
org.apache.hadoop.security.UserGroupInformation.getBestUGI(UserGroupInformation.java:742)
at org.apache.hadoop.fs.FileSystem.newInstance(FileSystem.java:404)
Caused by: javax.security.auth.login.LoginException: Unable to obtain Principal 
Name for authentication
at 
com.sun.security.auth.module.Krb5LoginModule.promptForName(Krb5LoginModule.java:841)
at 
com.sun.security.auth.module.Krb5LoginModule.attemptAuthentication(Krb5LoginModule.java:704)
at com.sun.security.auth.module.Krb5LoginModule.login(Krb5LoginModule.java:617)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at javax.security.auth.login.LoginContext.invoke(LoginContext.java:755)
at javax.security.auth.login.LoginContext.access$000(LoginContext.java:195)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:682)
at javax.security.auth.login.LoginContext$4.run(LoginContext.java:680)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.login.LoginContext.invokePriv(LoginContext.java:680)
at javax.security.auth.login.LoginContext.login(LoginContext.java:587)
at 
org.apache.hadoop.security.UserGroupInformation.getUGIFromTicketCache(UserGroupInformation.java:788)
... 2 more

{noformat}
This is reproducible easily in a test cluster with an extremely short ticket 
life time (e.g. 1 minute)

Note: HDFS-3608 addresses a similar issue, but in this case, since the ticket 
cache file itself does not change, fuse couldn't detect & update.

It looks like it should call UserGroupInformation#loginFromKeytab() in the 
beginning, similar to how balancer supports keytab based login (HDFS-9804). 
Thanks [~xiaochen] for the idea.

Or alternatively, have a background process that continuously relogin from 
keytab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13492) Limit httpfs binds to certain IP addresses in branch-2

2018-04-23 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13492:
--

 Summary: Limit httpfs binds to certain IP addresses in branch-2
 Key: HDFS-13492
 URL: https://issues.apache.org/jira/browse/HDFS-13492
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: httpfs
Affects Versions: 2.6.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


Currently httpfs binds to all IP addresses of the host by default. Some 
operators want to limit httpfs to accept only local connections.

We should provide that option, and it's pretty doable in Hadoop 2.x.

Note that httpfs underlying implementation changed in Hadoop 3, and the Jetty 
based httpfs implementation already support that I believe.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13487) Backport HDFS-11915 to branch-2.8, branch-2.7

2018-04-19 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13487:
--

 Summary: Backport HDFS-11915 to branch-2.8, branch-2.7
 Key: HDFS-13487
 URL: https://issues.apache.org/jira/browse/HDFS-13487
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


HDFS-11915 (Sync rbw dir on the first hsync() to avoid file lost on power 
failure) is a nice fix to have. Since a related fix HDFS-5042 was backported in 
branch-2.8 and branch-2.7. HDFS-11915 should get backported as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13486) Backport HDFS-11817 to branch-2.7

2018-04-19 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13486:
--

 Summary: Backport HDFS-11817 to branch-2.7
 Key: HDFS-13486
 URL: https://issues.apache.org/jira/browse/HDFS-13486
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


HDFS-11817 is a good fix to have in branch-2.7.

I'm taking a stab at it now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13485) DataNode WebUI throws NPE

2018-04-19 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13485:
--

 Summary: DataNode WebUI throws NPE
 Key: HDFS-13485
 URL: https://issues.apache.org/jira/browse/HDFS-13485
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, webhdfs
Affects Versions: 3.0.0
 Environment: Kerberized. Hadoop 3.0.0, WebHDFS.
Reporter: Wei-Chiu Chuang


curl -k -i --negotiate -u : "https://hadoop3-4.example.com:20004/webhdfs/v1;

DataNode Web UI should do a better error checking/handling. 

{noformat}
2018-04-19 10:07:49,338 WARN 
org.apache.hadoop.hdfs.server.datanode.web.webhdfs.WebHdfsHandler: 
INTERNAL_SERVER_ERROR
java.lang.NullPointerException
at org.apache.hadoop.security.token.Token.decodeWritable(Token.java:364)
at 
org.apache.hadoop.security.token.Token.decodeFromUrlString(Token.java:383)
at 
org.apache.hadoop.hdfs.server.datanode.web.webhdfs.ParameterParser.delegationToken(ParameterParser.java:128)
at 
org.apache.hadoop.hdfs.server.datanode.web.webhdfs.DataNodeUGIProvider.ugi(DataNodeUGIProvider.java:76)
at 
org.apache.hadoop.hdfs.server.datanode.web.webhdfs.WebHdfsHandler.channelRead0(WebHdfsHandler.java:129)
at 
org.apache.hadoop.hdfs.server.datanode.web.URLDispatcher.channelRead0(URLDispatcher.java:51)
at 
org.apache.hadoop.hdfs.server.datanode.web.URLDispatcher.channelRead0(URLDispatcher.java:31)
at 
com.cloudera.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
com.cloudera.io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
com.cloudera.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
at 
com.cloudera.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
com.cloudera.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1379)
at 
com.cloudera.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1158)
at 
com.cloudera.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1193)
at 
com.cloudera.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:489)
at 
com.cloudera.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:428)
at 
com.cloudera.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
at 
com.cloudera.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
at 
com.cloudera.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
at 
com.cloudera.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
at 
com.cloudera.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:138)
at 
com.cloudera.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
at 
com.cloudera.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.jav

[jira] [Created] (HDFS-13440) Log HDFS file name when client fails to connect

2018-04-12 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13440:
--

 Summary: Log HDFS file name when client fails to connect
 Key: HDFS-13440
 URL: https://issues.apache.org/jira/browse/HDFS-13440
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


HDFS-11993 added block name in log message when a dfsclient fails to connect, 
which is good.

As a follow-on, it can also log HDFS file name too, just in 
DFSInputStream#actualGetFromOneDataNode



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: Apache Hadoop 2.9.1 Release Plan

2018-04-06 Thread Wei-Chiu Chuang

Done.

Both HDFS-11915 and HDFS-13364 are in branch-2.9 and branch-2.9.1.

Thank you Sammi for driving 2.9.1 release

On Fri, Apr 6, 2018 at 10:14 AM, Wei-Chiu Chuang <weic...@cloudera.com>
wrote:

>
>1.
>   Looks like:
>2. HDFS-13347 <https://issues.apache.org/jira/browse/HDFS-13347> breaks
>compilation, which is later fixed byHDFS-13364
><https://issues.apache.org/jira/browse/HDFS-13364>. But HDFS-13364 is
>only available in 2.9.2 (branch-2.9)
>
>
>
>
>
>Before I cherrypick HDFS-11915, I will cherry pick HDFS-13364 first
>into branch-2.9.1.
>
>
> On Fri, Apr 6, 2018 at 9:46 AM, Chen, Sammi <sammi.c...@intel.com> wrote:
>
>> Hi Wei-Chiu,
>>
>> HDFS-11915 improves the data reliability when there is power failure.
>> I'm happy to include it in 2.9.1 release.
>>
>> Would you help to cherry-pick it to both branch-2.9 and branch-2.9.1?
>>
>>
>> Thanks,
>> Sammi
>>
>> -Original Message-
>> From: Wei-Chiu Chuang [mailto:weic...@cloudera.com]
>> Sent: Thursday, April 5, 2018 1:30 AM
>> To: Chen, Sammi <sammi.c...@intel.com>
>> Cc: hdfs-dev <hdfs-dev@hadoop.apache.org>; mapreduce-...@hadoop.apache.or
>> g; common-...@hadoop.apache.org; yarn-...@hadoop.apache.org
>> Subject: Re: Apache Hadoop 2.9.1 Release Plan
>>
>> Sorry Sammi I was late to this thread.
>> Please considering incorporating HDFS-11915. Sync rbw dir on the first
>> hsync() to avoid file lost on power failure.
>> I thought it was already in 2.9.1 but turns out it didn't land. The
>> cherry pick to branch-2.9 is conflict free.
>>
>> On Mon, Apr 2, 2018 at 4:34 AM, Chen, Sammi <sammi.c...@intel.com> wrote:
>>
>> > Hi All,
>> >
>> > Today I have created branch-2.9.1 from branch-2.9 and started creating
>> the
>> > RC0  based on branch-2.9.1.   But due to the corporate network
>> conditions
>> > and my not full privileges on Hadoop,   it will take a while for RC0 to
>> > come out.
>> >
>> > If you have anything want to commit to branch-2.9.1,  please let me
>> know.
>> >
>> > Also I will update fix version of all  2.9.1 JIRAs and moved all
>> > unresolved JIRA with target version = 2.9.1 to 2.9.2.
>> >
>> >
>> >
>> > Bests,
>> > Sammi Chen
>> >
>> > -Original Message-
>> > From: Chen, Sammi [mailto:sammi.c...@intel.com]
>> > Sent: Friday, March 30, 2018 3:55 PM
>> > To: hdfs-dev <hdfs-dev@hadoop.apache.org>;
>> > mapreduce-...@hadoop.apache.org; common-...@hadoop.apache.org;
>> > yarn-...@hadoop.apache.org
>> > Subject: Apache Hadoop 2.9.1 Release Plan
>> >
>> > Hi All,
>> >
>> > We have 47 changes on 2.9 branch since last release on Nov. 2017.
>>  There
>> > are 7 blockers, 5 critical issues and rest are normal bug fixes and
>> > feature improvements.
>> >
>> >
>> >
>> >
>> >
>> > Here are current tasks targeting for 2.9.1.  No critical and blockers
>> > so far.
>> >
>> > https://issues.apache.org/jira/issues/?jql=%22Target+
>> > Version%2Fs%22+%3D+2.9.1+AND+%28project+%3D+hadoop+OR+
>> > project+%3D+%22Hadoop+HDFS%22+OR+project+%3D+%22Hadoop+YARN%
>> > 22+OR+project+%3D+%22Hadoop+Map%2FReduce%22+OR+project+%
>> > 3D+%22Hadoop+Common%22%29+AND+status+%21%3D+resolved+ORDER+
>> > BY+priority+DESC
>> >
>> >
>> > I plan to cut the 2.9.1 branch today, and try to deliver the RC0  ASAP.
>> >  Please let me know if you have any objections or suggestions.
>> >
>> >
>> >
>> >
>> >
>> >
>> > Bests,
>> >
>> > Sammi
>> >
>> >
>> >
>> >
>> > -
>> > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>> > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>> >
>> >
>>
>>
>> --
>> A very happy Clouderan
>>
>
>
>
> --
> A very happy Clouderan
>



-- 
A very happy Hadoop contributor

Re: Apache Hadoop 2.9.1 Release Plan

2018-04-06 Thread Wei-Chiu Chuang

   1.
  Looks like:
   2. HDFS-13347 <https://issues.apache.org/jira/browse/HDFS-13347> breaks
   compilation, which is later fixed byHDFS-13364
   <https://issues.apache.org/jira/browse/HDFS-13364>. But HDFS-13364 is
   only available in 2.9.2 (branch-2.9)





   Before I cherrypick HDFS-11915, I will cherry pick HDFS-13364 first into
   branch-2.9.1.


On Fri, Apr 6, 2018 at 9:46 AM, Chen, Sammi <sammi.c...@intel.com> wrote:

> Hi Wei-Chiu,
>
> HDFS-11915 improves the data reliability when there is power failure.  I'm
> happy to include it in 2.9.1 release.
>
> Would you help to cherry-pick it to both branch-2.9 and branch-2.9.1?
>
>
> Thanks,
> Sammi
>
> -Original Message-
> From: Wei-Chiu Chuang [mailto:weic...@cloudera.com]
> Sent: Thursday, April 5, 2018 1:30 AM
> To: Chen, Sammi <sammi.c...@intel.com>
> Cc: hdfs-dev <hdfs-dev@hadoop.apache.org>; mapreduce-...@hadoop.apache.org;
> common-...@hadoop.apache.org; yarn-...@hadoop.apache.org
> Subject: Re: Apache Hadoop 2.9.1 Release Plan
>
> Sorry Sammi I was late to this thread.
> Please considering incorporating HDFS-11915. Sync rbw dir on the first
> hsync() to avoid file lost on power failure.
> I thought it was already in 2.9.1 but turns out it didn't land. The cherry
> pick to branch-2.9 is conflict free.
>
> On Mon, Apr 2, 2018 at 4:34 AM, Chen, Sammi <sammi.c...@intel.com> wrote:
>
> > Hi All,
> >
> > Today I have created branch-2.9.1 from branch-2.9 and started creating
> the
> > RC0  based on branch-2.9.1.   But due to the corporate network conditions
> > and my not full privileges on Hadoop,   it will take a while for RC0 to
> > come out.
> >
> > If you have anything want to commit to branch-2.9.1,  please let me know.
> >
> > Also I will update fix version of all  2.9.1 JIRAs and moved all
> > unresolved JIRA with target version = 2.9.1 to 2.9.2.
> >
> >
> >
> > Bests,
> > Sammi Chen
> >
> > -Original Message-
> > From: Chen, Sammi [mailto:sammi.c...@intel.com]
> > Sent: Friday, March 30, 2018 3:55 PM
> > To: hdfs-dev <hdfs-dev@hadoop.apache.org>;
> > mapreduce-...@hadoop.apache.org; common-...@hadoop.apache.org;
> > yarn-...@hadoop.apache.org
> > Subject: Apache Hadoop 2.9.1 Release Plan
> >
> > Hi All,
> >
> > We have 47 changes on 2.9 branch since last release on Nov. 2017.   There
> > are 7 blockers, 5 critical issues and rest are normal bug fixes and
> > feature improvements.
> >
> >
> >
> >
> >
> > Here are current tasks targeting for 2.9.1.  No critical and blockers
> > so far.
> >
> > https://issues.apache.org/jira/issues/?jql=%22Target+
> > Version%2Fs%22+%3D+2.9.1+AND+%28project+%3D+hadoop+OR+
> > project+%3D+%22Hadoop+HDFS%22+OR+project+%3D+%22Hadoop+YARN%
> > 22+OR+project+%3D+%22Hadoop+Map%2FReduce%22+OR+project+%
> > 3D+%22Hadoop+Common%22%29+AND+status+%21%3D+resolved+ORDER+
> > BY+priority+DESC
> >
> >
> > I plan to cut the 2.9.1 branch today, and try to deliver the RC0  ASAP.
> >  Please let me know if you have any objections or suggestions.
> >
> >
> >
> >
> >
> >
> > Bests,
> >
> > Sammi
> >
> >
> >
> >
> > -
> > To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> > For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> >
> >
>
>
> --
> A very happy Clouderan
>



-- 
A very happy Clouderan

Re: Apache Hadoop 2.9.1 Release Plan

2018-04-04 Thread Wei-Chiu Chuang

Sorry Sammi I was late to this thread.
Please considering incorporating HDFS-11915. Sync rbw dir on the first
hsync() to avoid file lost on power failure.
I thought it was already in 2.9.1 but turns out it didn't land. The cherry
pick to branch-2.9 is conflict free.

On Mon, Apr 2, 2018 at 4:34 AM, Chen, Sammi  wrote:

> Hi All,
>
> Today I have created branch-2.9.1 from branch-2.9 and started creating the
> RC0  based on branch-2.9.1.   But due to the corporate network conditions
> and my not full privileges on Hadoop,   it will take a while for RC0 to
> come out.
>
> If you have anything want to commit to branch-2.9.1,  please let me know.
>
> Also I will update fix version of all  2.9.1 JIRAs and moved all
> unresolved JIRA with target version = 2.9.1 to 2.9.2.
>
>
>
> Bests,
> Sammi Chen
>
> -Original Message-
> From: Chen, Sammi [mailto:sammi.c...@intel.com]
> Sent: Friday, March 30, 2018 3:55 PM
> To: hdfs-dev ; mapreduce-...@hadoop.apache.org;
> common-...@hadoop.apache.org; yarn-...@hadoop.apache.org
> Subject: Apache Hadoop 2.9.1 Release Plan
>
> Hi All,
>
> We have 47 changes on 2.9 branch since last release on Nov. 2017.   There
> are 7 blockers, 5 critical issues and rest are normal bug fixes and feature
> improvements.
>
>
>
>
>
> Here are current tasks targeting for 2.9.1.  No critical and blockers so
> far.
>
> https://issues.apache.org/jira/issues/?jql=%22Target+
> Version%2Fs%22+%3D+2.9.1+AND+%28project+%3D+hadoop+OR+
> project+%3D+%22Hadoop+HDFS%22+OR+project+%3D+%22Hadoop+YARN%
> 22+OR+project+%3D+%22Hadoop+Map%2FReduce%22+OR+project+%
> 3D+%22Hadoop+Common%22%29+AND+status+%21%3D+resolved+ORDER+
> BY+priority+DESC
>
>
> I plan to cut the 2.9.1 branch today, and try to deliver the RC0  ASAP.
>  Please let me know if you have any objections or suggestions.
>
>
>
>
>
>
> Bests,
>
> Sammi
>
>
>
>
> -
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>
>


-- 
A very happy Clouderan

[jira] [Created] (HDFS-13393) Improve OOM logging

2018-04-03 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13393:
--

 Summary: Improve OOM logging
 Key: HDFS-13393
 URL: https://issues.apache.org/jira/browse/HDFS-13393
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: balancer  mover, datanode
Reporter: Wei-Chiu Chuang


It is not uncommon to find "java.lang.OutOfMemoryError: unable to create new 
native thread" error in a HDFS cluster. Most often this happens when DataNode 
creating DataXceiver threads, or when balancer creates threads for moving 
blocks around.

In most of cases, the "OOM" is a symptom of number of threads reaching system 
limit, rather than actually running out of memory.

How about capturing the OOM, and if it is due to "unable to create new native 
thread", print some more helpful message like "bump your ulimit" or "take a 
jstack of the process"?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12165) getSnapshotDiffReport throws NegativeArraySizeException for very large snapshot diff summary

2018-03-29 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-12165.

Resolution: Duplicate

> getSnapshotDiffReport throws NegativeArraySizeException for very large 
> snapshot diff summary
> 
>
> Key: HDFS-12165
> URL: https://issues.apache.org/jira/browse/HDFS-12165
> Project: Hadoop HDFS
>  Issue Type: Bug
>        Reporter: Wei-Chiu Chuang
>Priority: Major
>
> For a really large snapshot diff, getSnapshotDiffReport throws 
> NegativeArraySizeException
> {noformat}
> 2017-07-19 11:14:16,415 WARN org.apache.hadoop.ipc.Server: Error serializing 
> call response for call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getSnapshotDiffReport
>  from 10.17.211.10:58223 Call#0 Retry#0
> java.lang.NegativeArraySizeException
> at 
> com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
> at 
> com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$RpcResponseWrapper.write(ProtobufRpcEngine.java:468)
> at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2410)
> at org.apache.hadoop.ipc.Server.access$500(Server.java:134)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2182)
> {noformat}
> This particular snapshot diff contains more than 25 million different file 
> system objects, and which means the serialized response can be more than 2GB, 
> overflowing protobuf length calculation.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13363) Record file path when FSDirAclOp throws AclException

2018-03-28 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13363:
--

 Summary: Record file path when FSDirAclOp throws AclException
 Key: HDFS-13363
 URL: https://issues.apache.org/jira/browse/HDFS-13363
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


When AclTransformation methods throws AclException, it does not record the file 
path that has the exception. These AclTransformation methods are invoked in 
FSDirAclOp methods, which know the file path. Therefore even if it throws an 
exception, we would never know which file has those invalid ACLs.

 

These FSDirAclOp methods can catch AclException, and then add the file path in 
the error message.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13357) Improve AclException message "Invalid ACL: only directories may have a default ACL."

2018-03-27 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13357:
--

 Summary: Improve AclException message "Invalid ACL: only 
directories may have a default ACL."
 Key: HDFS-13357
 URL: https://issues.apache.org/jira/browse/HDFS-13357
 Project: Hadoop HDFS
  Issue Type: Improvement
 Environment: CDH 5.10.1, Kerberos, KMS, encryption at rest, Sentry, 
Hive
Reporter: Wei-Chiu Chuang


I found this warning message in a HDFS cluster
{noformat}
2018-03-27 19:15:28,841 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
90 on 8020, call org.apache.hadoop.hdfs.protocol.ClientProtocol.setAcl from 
10.0.0.1:39508 Call#79376996
Retry#0: org.apache.hadoop.hdfs.protocol.AclException: Invalid ACL: only 
directories may have a default ACL.
2018-03-27 19:15:28,841 WARN org.apache.hadoop.security.UserGroupInformation: 
PriviledgedActionException as:hive/host1.example@example.com (auth:KERBE
ROS) cause:org.apache.hadoop.hdfs.protocol.AclException: Invalid ACL: only 
directories may have a default ACL.
{noformat}
However it doesn't tell me which file had this invalid ACL.

This cluster has Sentry enabled, so it is possible this invalid ACL doesn't 
come from HDFS, but from Sentry.

File this Jira to improve the message and add file name in it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [VOTE] Adopt HDSL as a new Hadoop subproject

2018-03-23 Thread Wei-Chiu Chuang

+1 (binding)

Happy to see the community converge on a proposal.

On Fri, Mar 23, 2018 at 11:18 AM, Andrew Wang 
wrote:

> +1
>
> If this VOTE is to gather consensus about establishing a new subproject,
> let's definitely proceed with that.
>
> It sounds like we're already discussing changes to the details of how the
> project will be run, and releasing from the branch vs. maven profile is not
> a blocker for me. I raised it since I thought it would reduce the amount of
> additional infra/build work, but it's fine if the preference is to just do
> the work. Sorry if my earlier reply sounded like bikeshedding.
>
> Cheers,
> Andrew
>
> On Fri, Mar 23, 2018 at 10:00 AM, Brahma Reddy Battula 
> wrote:
>
> > +1 ( binding)
> >
> >
> >
> > On Tue, Mar 20, 2018 at 11:50 PM, Owen O'Malley 
> > wrote:
> >
> > > All,
> > >
> > > Following our discussions on the previous thread (Merging branch
> > HDFS-7240
> > > to trunk), I'd like to propose the following:
> > >
> > > * HDSL become a subproject of Hadoop.
> > > * HDSL will release separately from Hadoop. Hadoop releases will not
> > > contain HDSL and vice versa.
> > > * HDSL will get its own jira instance so that the release tags stay
> > > separate.
> > > * On trunk (as opposed to release branches) HDSL will be a separate
> > module
> > > in Hadoop's source tree. This will enable the HDSL to work on their
> trunk
> > > and the Hadoop trunk without making releases for every change.
> > > * Hadoop's trunk will only build HDSL if a non-default profile is
> > enabled.
> > > * When Hadoop creates a release branch, the RM will delete the HDSL
> > module
> > > from the branch.
> > > * HDSL will have their own Yetus checks and won't cause failures in the
> > > Hadoop patch check.
> > >
> > > I think this accomplishes most of the goals of encouraging HDSL
> > development
> > > while minimizing the potential for disruption of HDFS development.
> > >
> > > The vote will run the standard 7 days and requires a lazy 2/3 vote. PMC
> > > votes are binding, but everyone is encouraged to vote.
> > >
> > > +1 (binding)
> > >
> > > .. Owen
> > >
> >
> >
> >
> > --
> >
> >
> >
> > --Brahma Reddy Battula
> >
>



-- 
A very happy Hadoop contributor

[jira] [Created] (HDFS-13330) Clean up dead code

2018-03-22 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13330:
--

 Summary: Clean up dead code
 Key: HDFS-13330
 URL: https://issues.apache.org/jira/browse/HDFS-13330
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang


The follow do .. while(false) loop seems useless to me.
{code:java}
ShortCircuitReplicaInfo info = null;
do {
  if (closed) {
LOG.trace("{}: can't fethchOrCreate {} because the cache is closed.",
this, key);
return null;
  }
  Waitable waitable = replicaInfoMap.get(key);
  if (waitable != null) {
try {
  info = fetch(key, waitable);
} catch (RetriableException e) {
  LOG.debug("{}: retrying {}", this, e.getMessage());
}
  }
} while (false);{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-10992) file is under construction but no leases found

2018-03-05 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10992?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-10992.

Resolution: Duplicate

> file is under construction but no leases found
> --
>
> Key: HDFS-10992
> URL: https://issues.apache.org/jira/browse/HDFS-10992
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.7.1
> Environment: hortonworks 2.3 build 2557. 10 Datanodes , 2 NameNode in 
> auto failover
>Reporter: Chernishev Aleksandr
>Priority: Major
>
> On hdfs after recording a small number of files (at least 1000) the size 
> (150Mb - 1,6Gb) found 13 damaged files with incomplete last block.
> hadoop fsck /hadoop/files/load_tarifer-zf-4_20160902165521521.csv 
> -openforwrite -files -blocks -locations
> DEPRECATED: Use of this script to execute hdfs command is deprecated.
> Instead use the hdfs command for it.
> Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF-8 -Dsun.jnu.encoding=UTF-8
> Connecting to namenode via 
> http://hadoop-hdfs:50070/fsck?ugi=hdfs=1=1=1=1=%2Fstaging%2Flanding%2Fstream%2Fitc_dwh%2Ffiles%2Fload_tarifer-zf-4_20160902165521521.csv
> FSCK started by hdfs (auth:SIMPLE) from /10.0.0.178 for path 
> /hadoop/files/load_tarifer-zf-4_20160902165521521.csv at Mon Oct 10 17:12:25 
> MSK 2016
> /hadoop/files/load_tarifer-zf-4_20160902165521521.csv 920596121 bytes, 7 
> block(s), OPENFORWRITE:  MISSING 1 blocks of total size 115289753 B
> 0. BP-1552885336-10.0.0.178-1446159880991:blk_1084952841_17798971 
> len=134217728 repl=4 
> [DatanodeInfoWithStorage[10.0.0.188:50010,DS-9ba44a76-113a-43ac-87dc-46aa97ba3267,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.183:50010,DS-eccd375a-ea32-491b-a4a3-5ea3faca4171,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.184:50010,DS-ec462491-6766-490a-a92f-38e9bb3be5ce,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.182:50010,DS-cef46399-bb70-4f1a-ac55-d71c7e820c29,DISK]]
> 1. BP-1552885336-10.0.0.178-1446159880991:blk_1084952850_17799207 
> len=134217728 repl=3 
> [DatanodeInfoWithStorage[10.0.0.184:50010,DS-412769e0-0ec2-48d3-b644-b08a516b1c2c,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.181:50010,DS-97388b2f-c542-417d-ab06-c8d81b94fa9d,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.187:50010,DS-e7a11951-4315-4425-a88b-a9f6429cc058,DISK]]
> 2. BP-1552885336-10.0.0.178-1446159880991:blk_1084952857_17799489 
> len=134217728 repl=3 
> [DatanodeInfoWithStorage[10.0.0.184:50010,DS-7a08c597-b0f4-46eb-9916-f028efac66d7,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.180:50010,DS-fa6a4630-1626-43d8-9988-955a86ac3736,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.182:50010,DS-8670e77d-c4db-4323-bb01-e0e64bd5b78e,DISK]]
> 3. BP-1552885336-10.0.0.178-1446159880991:blk_1084952866_17799725 
> len=134217728 repl=3 
> [DatanodeInfoWithStorage[10.0.0.185:50010,DS-b5ff8ba0-275e-4846-b5a4-deda35aa0ad8,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.180:50010,DS-9cb6cade-9395-4f3a-ab7b-7fabd400b7f2,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.183:50010,DS-e277dcf3-1bce-4efd-a668-cd6fb2e10588,DISK]]
> 4. BP-1552885336-10.0.0.178-1446159880991:blk_1084952872_17799891 
> len=134217728 repl=4 
> [DatanodeInfoWithStorage[10.0.0.184:50010,DS-e1d8f278-1a22-4294-ac7e-e12d554aef7f,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.186:50010,DS-5d9aeb2b-e677-41cd-844e-4b36b3c84092,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.183:50010,DS-eccd375a-ea32-491b-a4a3-5ea3faca4171,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.182:50010,DS-8670e77d-c4db-4323-bb01-e0e64bd5b78e,DISK]]
> 5. BP-1552885336-10.0.0.78-1446159880991:blk_1084952880_17800120 
> len=134217728 repl=3 
> [DatanodeInfoWithStorage[10.0.0.181:50010,DS-79185b75-1938-4c91-a6d0-bb6687ca7e56,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.184:50010,DS-dcbd20aa-0334-49e0-b807-d2489f5923c6,DISK],
>  
> DatanodeInfoWithStorage[10.0.0.183:50010,DS-f1d77328-f3af-483e-82e9-66ab0723a52c,DISK]]
> 6. 
> BP-1552885336-10.0.0.178-1446159880991:blk_1084952887_17800316{UCState=COMMITTED,
>  truncateBlock=null, primaryNodeIndex=-1, 
> replicas=[ReplicaUC[[DISK]DS-5f3eac72-eb55-4df7-bcaa-a6fa35c166a0:NORMAL:10.0.0.188:50010|RBW],
>  
> ReplicaUC[[DISK]DS-a2a0d8f0-772e-419f-b4ff-10b4966c57ca:NORMAL:10.0.0.184:50010|RBW],
>  
> ReplicaUC[[DISK]DS-52984aa0-598e-4fff-acfa-8904ca7b585c:NORMAL:10.0.0.185:50010|RBW]]}
>  len=115289753 MISSING!
> Status: CORRUPT
>  Total size:  920596121 B
>  Total dirs:  0
>  Total files: 1
>  Total symlinks:  0
>  Total blocks (validated):7 (avg. block size 131513731 B)
>   
>   UNDER MIN REPL'D BLOCKS:1

[jira] [Created] (HDFS-13103) HDFS Client write acknowledgement timeout should not depend on read timeout

2018-02-02 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13103:
--

 Summary: HDFS Client write acknowledgement timeout should not 
depend on read timeout
 Key: HDFS-13103
 URL: https://issues.apache.org/jira/browse/HDFS-13103
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, hdfs-client
Affects Versions: 3.0.0-alpha1, 2.8.0
 Environment: CDH5.7.0 and above. HBase Region Server.
Reporter: Wei-Chiu Chuang


HDFS-8311 added a timeout for client write acknowledgement for both
 # transferring blocks
 # writing to a DataNode.

The timeout shares the same configuration as client read timeout 
(dfs.client.socket-timeout).

While I agree having a timeout is good, it does not make sense for the write 
acknowledgement timeout to depend on read timeout. We saw a case where cluster 
admin wants to reduce HBase RegionServer read timeout so as to detect DataNode 
crash quickly, but did not realize it affects write acknowledgement timeout.

In the end, the effective DataNode write timeout is shorter than the effective 
client write acknowledgement timeout. If the last two DataNodes in the write 
pipeline crashes, client would think the first DataNode is faulty (the DN 
appears unresponsive because it is still waiting for the ack from downstream 
DNs), dropping it and then HBase RS would crash because it is unable to write 
to any good DataNode. This scenario is possible during a rack failure.

This problem is even worse for Cloudera Manager-managed cluster. By default, 
CM-managed HBase RegionServer sets 
{{dfs.client.block.write.replace-datanode-on-failure.enable = true}}. Even one 
unresponsive DataNode could crash HBase RegionServer.

I am raising this Jira to discuss two possible solutions
 # add a new config for write acknowledgement timeout. Do not depend on read 
timeout
 # or, update the description of dfs.client.socket-timeout in core-default.xml 
so that admin is aware write acknowledgement timeout depends on this 
configuration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-13040) Kerberized inotify client fails despite kinit properly

2018-01-19 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-13040:
--

 Summary: Kerberized inotify client fails despite kinit properly
 Key: HDFS-13040
 URL: https://issues.apache.org/jira/browse/HDFS-13040
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 2.6.0
 Environment: Kerberized, HA cluster, iNotify client, CDH5.10.2
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


This issue is similar to HDFS-10799.

HDFS-10799 turned out to be a client side issue where client is responsible for 
renewing kerberos ticket actively.

However we found in a slightly setup even if client has valid Kerberos 
credentials, inotify still fails.

Suppose client uses principal h...@example.com, 
namenode 1 uses server principal hdfs/nn1.example@example.com
namenode 2 uses server principal hdfs/nn2.example@example.com

*After Namenodes starts for longer than kerberos ticket lifetime*, the client 
fails with the following error:

{noformat}
18/01/19 11:23:02 WARN security.UserGroupInformation: 
PriviledgedActionException as:h...@gce.cloudera.com (auth:KERBEROS) 
cause:org.apache.hadoop.ipc.RemoteException(java.io.IOException): We 
encountered an error reading 
https://nn2.example.com:8481/getJournal?jid=ns1=8662=-60%3A353531113%3A0%3Acluster3,
 
https://nn1.example.com:8481/getJournal?jid=ns1=8662=-60%3A353531113%3A0%3Acluster3.
  During automatic edit log failover, we noticed that all of the remaining edit 
log streams are shorter than the current one!  The best remaining edit log ends 
at transaction 8683, but we thought we could read up to transaction 8684.  If 
you continue, metadata will be lost forever!
at 
org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:213)
at 
org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1701)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1763)
at 
org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1011)
at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1490)
at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2216)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2212)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2210)
{noformat}

Typically if NameNode has an expired Kerberos ticket, the error handling for 
the typical edit log tailing would let NameNode to relogin with its own 
Kerberos principal. However, when inotify uses the same code path to retrieve 
edits, since the current user is the inotify client's principal, unless client 
uses the same principal as the NameNode, NameNode can't do it on behalf of the 
client.

Therefore, a more appropriate approach is to use proxy user so that NameNode 
can  retrieving edits on behalf of the client.

I will attach a patch to fix it. This patch has been verified to work for a 
CDH5.10.2 cluster, however it seems impossible to craft a unit test for this 
fix because the way Hadoop UGI is handled (I can't have a single process that 
logins as two Kerberos principals simultaneously and let them establish 
connection)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [VOTE] Release Apache Hadoop 3.0.0 RC1

2017-12-12 Thread Wei-Chiu Chuang

Hi Andrew, thanks the tremendous effort.
I found an empty "patchprocess" directory in the source tarball, that is
not there if you clone from github. Any chance you might have some leftover
trash when you made the tarball?
Not wanting to nitpicking, but you might want to double check so we don't
ship anything private to you in public :)



On Tue, Dec 12, 2017 at 7:48 AM, Ajay Kumar 
wrote:

> +1 (non-binding)
> Thanks for driving this, Andrew Wang!!
>
> - downloaded the src tarball and verified md5 checksum
> - built from source with jdk 1.8.0_111-b14
> - brought up a pseudo distributed cluster
> - did basic file system operations (mkdir, list, put, cat) and
> confirmed that everything was working
> - Run word count, pi and DFSIOTest
> - run hdfs and yarn, confirmed that the NN, RM web UI worked
>
> Cheers,
> Ajay
>
> On 12/11/17, 9:35 PM, "Xiao Chen"  wrote:
>
> +1 (binding)
>
> - downloaded src tarball, verified md5
> - built from source with jdk1.8.0_112
> - started a pseudo cluster with hdfs and kms
> - sanity checked encryption related operations working
> - sanity checked webui and logs.
>
> -Xiao
>
> On Mon, Dec 11, 2017 at 6:10 PM, Aaron T. Myers 
> wrote:
>
> > +1 (binding)
> >
> > - downloaded the src tarball and built the source (-Pdist -Pnative)
> > - verified the checksum
> > - brought up a secure pseudo distributed cluster
> > - did some basic file system operations (mkdir, list, put, cat) and
> > confirmed that everything was working
> > - confirmed that the web UI worked
> >
> > Best,
> > Aaron
> >
> > On Fri, Dec 8, 2017 at 12:31 PM, Andrew Wang <
> andrew.w...@cloudera.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > Let me start, as always, by thanking the efforts of all the
> contributors
> > > who contributed to this release, especially those who jumped on the
> > issues
> > > found in RC0.
> > >
> > > I've prepared RC1 for Apache Hadoop 3.0.0. This release
> incorporates 302
> > > fixed JIRAs since the previous 3.0.0-beta1 release.
> > >
> > > You can find the artifacts here:
> > >
> > > http://home.apache.org/~wang/3.0.0-RC1/
> > >
> > > I've done the traditional testing of building from the source
> tarball and
> > > running a Pi job on a single node cluster. I also verified that the
> > shaded
> > > jars are not empty.
> > >
> > > Found one issue that create-release (probably due to the mvn deploy
> > change)
> > > didn't sign the artifacts, but I fixed that by calling mvn one
> more time.
> > > Available here:
> > >
> > > https://repository.apache.org/content/repositories/orgapache
> hadoop-1075/
> > >
> > > This release will run the standard 5 days, closing on Dec 13th at
> 12:31pm
> > > Pacific. My +1 to start.
> > >
> > > Best,
> > > Andrew
> > >
> >
>
>
>
>
>
>
>
> -
> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
>

[jira] [Created] (HDFS-12915) Fix findbugs warning in INodeFile$HeaderFormat.getBlockLayoutRedundancy

2017-12-11 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12915:
--

 Summary: Fix findbugs warning in 
INodeFile$HeaderFormat.getBlockLayoutRedundancy
 Key: HDFS-12915
 URL: https://issues.apache.org/jira/browse/HDFS-12915
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.0.0
Reporter: Wei-Chiu Chuang


It seems HDFS-12840 creates a new findbugs warning.

Possible null pointer dereference of replication in 
org.apache.hadoop.hdfs.server.namenode.INodeFile$HeaderFormat.getBlockLayoutRedundancy(BlockType,
 Short, Byte)
Bug type NP_NULL_ON_SOME_PATH (click for details) 
In class org.apache.hadoop.hdfs.server.namenode.INodeFile$HeaderFormat
In method 
org.apache.hadoop.hdfs.server.namenode.INodeFile$HeaderFormat.getBlockLayoutRedundancy(BlockType,
 Short, Byte)
Value loaded from replication
Dereferenced at INodeFile.java:[line 210]
Known null at INodeFile.java:[line 207]

>From a quick look at the patch, it seems bogus though. [~eddyxu][~Sammi] would 
>you please double check?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [VOTE] Release Apache Hadoop 3.0.0 RC0

2017-11-20 Thread Wei-Chiu Chuang

@vinod
I followed your command but I could not reproduce your problem.

[weichiu@storage-1 hadoop-3.0.0-src]$ ls -al hadoop-common-project/hadoop-c
ommon/target/hadoop-common-3.0.0.tar.gz
-rw-rw-r-- 1 weichiu weichiu 37052439 Nov 20 21:59
hadoop-common-project/hadoop-common/target/hadoop-common-3.0.0.tar.gz
[weichiu@storage-1 hadoop-3.0.0-src]$ ls -al hadoop-hdfs-project/hadoop-hdf
s/target/hadoop-hdfs-3.0.0.tar.gz
-rw-rw-r-- 1 weichiu weichiu 29044067 Nov 20 22:00
hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-3.0.0.tar.gz

During compilation I found the following error with a Java 1.8.0_5 JDK:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven
-compiler-plugin:3.1:testCompile (default-testCompile) on project
hadoop-aws: Compilation failure: Compilation failure:
[ERROR] /home/weichiu/hadoop-3.0.0-src/hadoop-tools/hadoop-aws/src/
test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionAlgorithmValidation.java:[45,5]
reference to intercept is ambiguous
[ERROR]   both method intercept(java.lang.Class,java.lang.String,org.apache.hadoop.test.LambdaTestUtils.VoidCallable) in
org.apache.hadoop.test.LambdaTestUtils and method
intercept(java.lang.Class,java.lang.String,java.util.concurrent.Callable)
in org.apache.hadoop.test.LambdaTestUtils match
[ERROR] /home/weichiu/hadoop-3.0.0-src/hadoop-tools/hadoop-aws/src/
test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionAlgorithmValidation.java:[69,5]
reference to intercept is ambiguous
[ERROR]   both method intercept(java.lang.Class,java.lang.String,org.apache.hadoop.test.LambdaTestUtils.VoidCallable) in
org.apache.hadoop.test.LambdaTestUtils and method
intercept(java.lang.Class,java.lang.String,java.util.concurrent.Callable)
in org.apache.hadoop.test.LambdaTestUtils match
[ERROR] /home/weichiu/hadoop-3.0.0-src/hadoop-tools/hadoop-aws/src/
test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionAlgorithmValidation.java:[94,5]
reference to intercept is ambiguous
[ERROR]   both method intercept(java.lang.Class,java.lang.String,org.apache.hadoop.test.LambdaTestUtils.VoidCallable) in
org.apache.hadoop.test.LambdaTestUtils and method
intercept(java.lang.Class,java.lang.String,java.util.concurrent.Callable)
in org.apache.hadoop.test.LambdaTestUtils match
[ERROR] /home/weichiu/hadoop-3.0.0-src/hadoop-tools/hadoop-aws/src/
test/java/org/apache/hadoop/fs/s3a/ITestS3AEncryptionAlgorithmValidation.java:[120,5]
reference to intercept is ambiguous
[ERROR]   both method intercept(java.lang.Class,java.lang.String,org.apache.hadoop.test.LambdaTestUtils.VoidCallable) in
org.apache.hadoop.test.LambdaTestUtils and method
intercept(java.lang.Class,java.lang.String,java.util.concurrent.Callable)
in org.apache.hadoop.test.LambdaTestUtils match

And then I realized Ray filed HADOOP-14900
 for the same
issue. This problem is not reproducible with a more recent JDK 8, such as
1.8.0_151
Maybe it would be a good idea to name a list of JDKs that are known to be
buggy. Can we get this documented somewhere? I don't consider it a blocker
so a release note in a later release or a wiki entry should be good enough.

On Mon, Nov 20, 2017 at 12:58 PM, Vinod Kumar Vavilapalli <
vino...@apache.org> wrote:

> Quick question.
>
> I used to be able (in 2.x line) to create dist tarballs (mvn clean install
> -Pdist -Dtar -DskipTests -Dmaven.javadoc.skip=true) from the source being
> voted on (hadoop-3.0.0-src.tar.gz).
>
> The idea is to install HDFS, YARN, MR separately in separate
> root-directories from the generated individual dist tarballs.
>
> But now I see that HDFS and common dist tarballs are empty
> -rw-r--r--  1 vinodkv  staff 45 Nov 20 12:39
> ./hadoop-common-project/hadoop-common/target/hadoop-common-3.0.0.tar.gz -
> -rw-r--r--  1 vinodkv  staff 45 Nov 20 12:40
> ./hadoop-hdfs-project/hadoop-hdfs/target/hadoop-hdfs-3.0.0.tar.gz
>
> But YARN and MR are fine
> -rw-r--r--  1 vinodkv  staff   64474187 Nov 20 12:41
> ./hadoop-yarn-project/target/hadoop-yarn-project-3.0.0.tar.gz
> -rw-r--r--  1 vinodkv  staff   21674457 Nov 20 12:41
> ./hadoop-mapreduce-project/target/hadoop-mapreduce-3.0.0.tar.gz
>
> Is it just me? Or is this broken?
>
> Thanks
> +Vinod
>
> > On Nov 14, 2017, at 1:34 PM, Andrew Wang 
> wrote:
> >
> > Hi folks,
> >
> > Thanks as always to the many, many contributors who helped with this
> > release. I've created RC0 for Apache Hadoop 3.0.0. The artifacts are
> > available here:
> >
> > http://people.apache.org/~wang/3.0.0-RC0/
> >
> > This vote will run 5 days, ending on Nov 19th at 1:30pm Pacific.
> >
> > 3.0.0 GA contains 291 fixed JIRA issues since 3.0.0-beta1. Notable
> > additions include the merge of YARN resource types, API-based
> configuration
> > of the CapacityScheduler, and HDFS router-based federation.
> >
> > I've done my traditional testing with a pseudo cluster and a Pi job. My
> +1
> > to start.
> >
> > Best,
> > Andrew
>
>
>

Re: [DISCUSS] Apache Hadoop 2.7.5 Release Plan

2017-11-17 Thread Wei-Chiu Chuang

Hi Konstantin,
Thanks for initiating the release effort.

I am marking HDFS-12641  as
a blocker for Hadoop 2.7.5 because during our internal testing for CDH we
found a regression in HDFS-11445 that was fixed by HDFS-11755 (technically
not a real regression since HDFS-11755 was committed before HDFS-11445).
The regression results in bogus corrupt block reports. It is not clear to
me if the same behavior is in Apache Hadoop, but since the later
(HDFS-11755) is currently Hadoop 2.8.x and above, I would want to be more
cautious about it.

On Thu, Nov 16, 2017 at 5:20 PM, Konstantin Shvachko 
wrote:

> Hi developers,
>
> We have accumulated about 30 commits on branch-2.7. Those are mostly
> valuable bug fixes, minor optimizations and test corrections. I would like
> to propose to make a quick maintenance release 2.7.5.
>
> If there are no objections I'll start preparations.
>
> Thanks,
> --Konstantin
>

-- 
A very happy Clouderan

[jira] [Resolved] (HDFS-12820) Decommissioned datanode is counted in service cause datanode allcating failure

2017-11-17 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-12820.

Resolution: Duplicate

Thanks for reporting the issue, [~xiegang112].
Hadoop 2.4.0 is an old release and no longer supported. The issue reported in 
this jira is fixed by HDFS-9279.

I am going to resolve this jira as a dup of HDFS-9279. Please reopen if this is 
not the case.

> Decommissioned datanode is counted in service cause datanode allcating failure
> --
>
> Key: HDFS-12820
> URL: https://issues.apache.org/jira/browse/HDFS-12820
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: block placement
>Affects Versions: 2.4.0
>Reporter: Gang Xie
>
> When allocate a datanode when dfsclient write with considering the load, it 
> checks if the datanode is overloaded by calculating the average xceivers of 
> all the in service datanode. But if the datanode is decommissioned and become 
> dead, it's still treated as in service, which make the average load much more 
> than the real one especially when the number of the decommissioned datanode 
> is great. In our cluster, 180 datanode, and 100 of them decommissioned, and 
> the average load is 17. This failed all the datanode allocation. 
> private void subtract(final DatanodeDescriptor node) {
>   capacityUsed -= node.getDfsUsed();
>   blockPoolUsed -= node.getBlockPoolUsed();
>   xceiverCount -= node.getXceiverCount();
> {color:red}  if (!(node.isDecommissionInProgress() || 
> node.isDecommissioned())) {{color}
> nodesInService--;
> nodesInServiceXceiverCount -= node.getXceiverCount();
> capacityTotal -= node.getCapacity();
> capacityRemaining -= node.getRemaining();
>   } else {
> capacityTotal -= node.getDfsUsed();
>   }
>   cacheCapacity -= node.getCacheCapacity();
>   cacheUsed -= node.getCacheUsed();
> }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12737) Thousands of sockets lingering in TIME_WAIT state due to frequent file open operations

2017-10-27 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12737:
--

 Summary: Thousands of sockets lingering in TIME_WAIT state due to 
frequent file open operations
 Key: HDFS-12737
 URL: https://issues.apache.org/jira/browse/HDFS-12737
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: ipc
 Environment: CDH5.10.2, HBase Multi-WAL=2, 250 replication peers
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


On a HBase cluster we found HBase RegionServers have thousands of sockets in 
TIME_WAIT state. It depleted system resources and caused other services to fail.

After months of troubleshooting, we found the issue is the cluster has hundreds 
of replication peers, and has multi-WAL = 2. That creates hundreds of 
replication threads in HBase RS, and each thread opens WAL file *every second*.

We found that the IPC client closes socket right away, and does not reuse 
socket connection. Since each closed socket stays in TIME_WAIT state for 60 
seconds in Linux by default, that generates thousands of TIME_WAIT sockets.

{code:title=ClientDatanodeProtocolTranslatorPB:createClientDatanodeProtocolProxy}
// Since we're creating a new UserGroupInformation here, we know that no
// future RPC proxies will be able to re-use the same connection. And
// usages of this proxy tend to be one-off calls.
//
// This is a temporary fix: callers should really achieve this by using
// RPC.stopProxy() on the resulting object, but this is currently not
// working in trunk. See the discussion on HDFS-1965.
Configuration confWithNoIpcIdle = new Configuration(conf);
confWithNoIpcIdle.setInt(CommonConfigurationKeysPublic
.IPC_CLIENT_CONNECTION_MAXIDLETIME_KEY, 0);
{code}
Unfortunately, given the HBase's usage pattern, this hack creates the problem.

Ignoring the fact that having hundreds of HBase replication peers is a bad 
practice (I'll probably file a HBASE jira to fix that), the fact that Hadoop 
IPC client does not reuse socket seems not right. The relevant code is 
historical and deep in the stack, so I'd like to invite comments. I have a 
patch but it's pretty hacky.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12676) when blocks has corrupted replicas，throws Exception

2017-10-18 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-12676.

Resolution: Duplicate

Resolving it as a dup. Thanks @lynn for reporting it.

> when blocks has corrupted replicas，throws Exception
> ---
>
> Key: HDFS-12676
> URL: https://issues.apache.org/jira/browse/HDFS-12676
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Reporter: lynn
>
> when blocks has corrupted replicas，throws Exception as follows:
> Exception 1:
> 2017-10-18 15:24:55,858 WARN  blockmanagement.BlockManager 
> (BlockManager.java:createLocatedBlock(938)) - Inconsistent number of corrupt 
> replicas for blk_1073750384_504374 blockMap has 0 but corrupt replicas map 
> has 1
> 2017-10-18 15:24:55,859 WARN  ipc.Server (Server.java:logException(2433)) - 
> IPC Server handler 116 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
> 10.43.160.18:56313 Call#2 Retry#-1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:972)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:884)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:1011)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2010)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1960)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1873)
>   at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:693)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
>   at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
>   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
>   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1865)
>   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2345)
> Exception 2:
> 2017-10-12 16:59:36,591 INFO  blockmanagement.BlockManager 
> (BlockManager.java:computeReplicationWorkForBlocks(1649)) - Blocks chosen but 
> could not be replicated = 4; of which 0 have no target, 4 have no source, 0 
> are UC, 0 are abandoned, 0 already have enough replicas.
> 2017-10-12 16:59:36,809 WARN  blockmanagement.BlockManager 
> (BlockManager.java:createLocatedBlock(938)) - Inconsistent number of corrupt 
> replicas for blk_1073789106_2278702 blockMap has 0 but corrupt replicas map 
> has 2
> 2017-10-12 16:59:36,810 WARN  ipc.Server (Server.java:logException(2433)) - 
> IPC Server handler 123 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getBlockLocations from 
> 10.46.230.12:47974 Call#2 Retry#-1
> java.lang.NegativeArraySizeException
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:946)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlock(BlockManager.java:911)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlockList(BlockManager.java:884)
>   at 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.createLocatedBlocks(BlockManager.java:997)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2010)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1960)
>   at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1873)
>   at 
> org.apache.hadoop.hdfs.s

[jira] [Created] (HDFS-12644) Offer a non-privileged listEncryptionZone operation

2017-10-12 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12644:
--

 Summary: Offer a non-privileged listEncryptionZone operation
 Key: HDFS-12644
 URL: https://issues.apache.org/jira/browse/HDFS-12644
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: encryption, namenode
Affects Versions: 3.0.0-alpha1, 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


As discussed in HDFS-12484, we can consider adding a non-privileged 
listEncryptionZone for better user experience.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-11797) BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException when corrupt replicas are inconsistent

2017-10-12 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-11797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-11797.

Resolution: Duplicate

I'm going to close it as a dup of HDFS-11445. Feel free to reopen if this is 
not the case. Thanks [~kshukla]!

> BlockManager#createLocatedBlocks() can throw ArrayIndexOutofBoundsException 
> when corrupt replicas are inconsistent
> --
>
> Key: HDFS-11797
> URL: https://issues.apache.org/jira/browse/HDFS-11797
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kuhu Shukla
>Assignee: Kuhu Shukla
>Priority: Critical
> Attachments: HDFS-11797.001.patch
>
>
> The calculation for {{numMachines}} can be too less (causing 
> ArrayIndexOutOfBoundsException) or too many (causing NPE (HDFS-9958)) if data 
> structures find inconsistent number of corrupt replicas. This was earlier 
> found related to failed storages. This JIRA tracks a change that works for 
> all possible cases of inconsistencies.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-12630) Rolling restart can create inconsistency between blockMap and corrupt replicas map

2017-10-12 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-12630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-12630.

Resolution: Duplicate

> Rolling restart can create inconsistency between blockMap and corrupt 
> replicas map
> --
>
> Key: HDFS-12630
> URL: https://issues.apache.org/jira/browse/HDFS-12630
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.6.0
>Reporter: Andre Araujo
>
> After a NN rolling restart several HDFS files started showing block problems. 
> Running FSCK for one of the files or for the directory that contained it 
> would complete with a FAILED message but without any details of the failure.
> The NameNode log showed the following:
> {code}
> 2017-10-10 16:58:32,147 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
> FSCK started by hdfs (auth:KERBEROS_SSL) from /10.92.128.4 for path 
> /user/prod/data/file_20171010092201.csv at Tue Oct 10 16:58:32 PDT 2017
> 2017-10-10 16:58:32,147 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager: Inconsistent 
> number of corrupt replicas for blk_1941920008_1133195379 blockMap has 1 but 
> corrupt replicas map has 2
> 2017-10-10 16:58:32,147 WARN org.apache.hadoop.hdfs.server.namenode.NameNode: 
> Fsck on path '/user/prod/data/file_20171010092201.csv' FAILED
> java.lang.ArrayIndexOutOfBoundsException
> {code}
> After triggering a full block report for all the DNs the problem went away.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12641) Backport HDFS-11755 into branch-2.7 to fix a regression in HDFS-11755

2017-10-11 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12641:
--

 Summary: Backport HDFS-11755 into branch-2.7 to fix a regression 
in HDFS-11755
 Key: HDFS-12641
 URL: https://issues.apache.org/jira/browse/HDFS-12641
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 2.7.4
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


Our internal testing caught a regression in HDFS-11445 when we cherry picked 
the commit into CDH. Basically, it produces bogus missing file warnings. 
Further analysis revealed that the regression is actually fixed by HDFS-11755.

Because of the order commits are merged in branch-2.8 ~ trunk (HDFS-11755 was 
committed before HDFS-11445), the regression was never actually surfaced for 
Hadoop 2.8/3.0.0-(alpha/beta) users. Since branch-2.7 has HDFS-11445 but no 
HDFS-11755, I suspect the regression is more visible for Hadoop 2.7.4.

I am filing this jira to raise more awareness, than simply backporting 
HDFS-11755 into branch-2.7.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12619) Do not catch and throw unchecked exceptions if IBRs fail to process

2017-10-09 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12619:
--

 Summary: Do not catch and throw unchecked exceptions if IBRs fail 
to process
 Key: HDFS-12619
 URL: https://issues.apache.org/jira/browse/HDFS-12619
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.0.0-alpha1, 2.7.3, 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Minor


HDFS-9198 added the following code
{code:title=BlockManager#processIncrementalBlockReport}
public void processIncrementalBlockReport(final DatanodeID nodeID,
  final StorageReceivedDeletedBlocks srdb) throws IOException {
assert namesystem.hasWriteLock();
final DatanodeDescriptor node = datanodeManager.getDatanode(nodeID);
if (node == null || !node.isRegistered()) {
  blockLog.warn("BLOCK* processIncrementalBlockReport"
  + " is received from dead or unregistered node {}", nodeID);
  throw new IOException(
  "Got incremental block report from unregistered or dead node");
}
try {
  processIncrementalBlockReport(node, srdb);
} catch (Exception ex) {
  node.setForceRegistration(true);
  throw ex;
}
  }
{code}
In Apache Hadoop 2.7.x ~ 3.0, the code snippet is accepted by Java compiler. 
However, when I attempted to backport it to a CDH5.3 release (based on Apache 
Hadoop 2.5.0), the compiler complains the exception is unhandled, because the 
method defines it throws IOException instead of Exception.

While the code compiles for Apache Hadoop 2.7.x ~ 3.0, I feel it is not a good 
practice to catch an unchecked exception and then rethrow it. How about 
rewriting it with a finally block and a conditional variable?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12485) expunge may not remove trash from non-home directory encryption zone

2017-09-18 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12485:
--

 Summary: expunge may not remove trash from non-home directory 
encryption zone
 Key: HDFS-12485
 URL: https://issues.apache.org/jira/browse/HDFS-12485
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0-alpha1, 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


If I log into Linux as root, and then login as the superuser h...@example.com
{noformat}
[root@nightly511-1 ~]# hdfs dfs -rm /scale/b
17/09/18 15:21:32 INFO fs.TrashPolicyDefault: Moved: 'hdfs://ns1/scale/b' to 
trash at: hdfs://ns1/scale/.Trash/hdfs/Current/scale/b
[root@nightly511-1 ~]# hdfs dfs -expunge
17/09/18 15:21:59 INFO fs.TrashPolicyDefault: 
TrashPolicyDefault#deleteCheckpoint for trashRoot: hdfs://ns1/user/hdfs/.Trash
17/09/18 15:21:59 INFO fs.TrashPolicyDefault: 
TrashPolicyDefault#deleteCheckpoint for trashRoot: hdfs://ns1/user/hdfs/.Trash
17/09/18 15:21:59 INFO fs.TrashPolicyDefault: Deleted trash checkpoint: 
/user/hdfs/.Trash/170918143916
17/09/18 15:21:59 INFO fs.TrashPolicyDefault: 
TrashPolicyDefault#createCheckpoint for trashRoot: hdfs://ns1/user/hdfs/.Trash
[root@nightly511-1 ~]# hdfs dfs -ls hdfs://ns1/scale/.Trash/hdfs/Current/scale/b
-rw-r--r--   3 hdfs systest  0 2017-09-18 15:21 
hdfs://ns1/scale/.Trash/hdfs/Current/scale/b
{noformat}

expunge does not remove trash under /scale, because it does not know I am 
'hdfs' user.

{code:title=DistributedFileSystem#getTrashRoots}
Path ezTrashRoot = new Path(it.next().getPath(),
FileSystem.TRASH_PREFIX);
if (!exists(ezTrashRoot)) {
  continue;
}
if (allUsers) {
  for (FileStatus candidate : listStatus(ezTrashRoot)) {
if (exists(candidate.getPath())) {
  ret.add(candidate);
}
  }
} else {
  Path userTrash = new Path(ezTrashRoot, System.getProperty(
  "user.name")); --> bug
  try {
ret.add(getFileStatus(userTrash));
  } catch (FileNotFoundException ignored) {
  }
}
{code}

It should use UGI for user name, rather than system login user name.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12484) hdfs dfs -expunge requires superuser permission after 2.8

2017-09-18 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12484:
--

 Summary: hdfs dfs -expunge requires superuser permission after 2.8
 Key: HDFS-12484
 URL: https://issues.apache.org/jira/browse/HDFS-12484
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: fs
Affects Versions: 3.0.0-alpha1, 2.8.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


Hadoop 2.8 added a feature to support trash inside encryption zones.

However, it breaks the existing -expunge semantics because now a user must have 
superuser permission in order to -expunge. The reason behind that is that 
-expunge gets all encryption zone paths using DFSClient#listEncryptionZones, 
which requires super user permission.

Not sure what's the best way to address this, so file this jira to invite 
comments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

If primary replica is unresponsive, hsync() hangs

2017-09-11 Thread Wei-Chiu Chuang

Hello my dear HDFS dev colleagues,

It appears that when a dfs client writes and hsync(), and if the primary
replica (that is, the first DataNode in the write pipeline) is unresponsive
to the hsync() request, the hsync() would wait at
DataStreamer#waitForAckedSeqno().

In one scenario, we saw this behavior when the primary DataNode has a flaky
disk drive controller, and DataNode was thus unable to write back ack to
client because it was unable to write to the disk successfully. The client
is a Flume agent and it finally bailed out after 180 seconds.

My question is: why doesn't hsync() replace bad DataNodes in the pipeline
just like the typical write pipeline failure recovery?

I would like to understand if this is intended before I file a jira and
post a patch.

Thanks,
Wei-Chiu
-- 
A very happy Hadoop contributor

[jira] [Created] (HDFS-12372) Document the impact of HDFS-11069 (Tighten the authorization of datanode RPC)

2017-08-29 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12372:
--

 Summary: Document the impact of HDFS-11069 (Tighten the 
authorization of datanode RPC)
 Key: HDFS-12372
 URL: https://issues.apache.org/jira/browse/HDFS-12372
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


The idea of HDFS-11069 is good. But it seems to cause confusion for 
administrators when they issue commands like hdfs diskbalancer, or hdfs 
dfsadmin, because this change of behavior is not documented properly.

I suggest we document a recommended way to kinit (e.g. kinit as 
hdfs/ho...@host1.example.com, rather than h...@example.com), as well as 
documenting a notice for running privileged DataNode commands in a Kerberized 
clusters



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12293) DataNode should log file name on disk error

2017-08-11 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12293:
--

 Summary: DataNode should log file name on disk error
 Key: HDFS-12293
 URL: https://issues.apache.org/jira/browse/HDFS-12293
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode
Reporter: Wei-Chiu Chuang


Found the following error message in precommit build 
https://builds.apache.org/job/hadoop-qbt-trunk-java8-linux-x86/488/testReport/junit/org.apache.hadoop.hdfs.server.datanode/TestDataNodeVolumeFailureReporting/testSuccessiveVolumeFailures/

{noformat}
2017-08-10 09:36:53,619 [DataXceiver for client 
DFSClient_NONMAPREDUCE_670847838_18 at /127.0.0.1:55851 [Receiving block 
BP-219227751-172.17.0.2-1502357801473:blk_1073741829_1005]] WARN  
datanode.DataNode (BlockReceiver.java:(287)) - IOException in 
BlockReceiver constructor. Cause is 
java.io.IOException: Not a directory
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1012)
at 
org.apache.hadoop.hdfs.server.datanode.FileIoProvider.createFile(FileIoProvider.java:302)
at 
org.apache.hadoop.hdfs.server.datanode.DatanodeUtil.createFileWithExistsCheck(DatanodeUtil.java:69)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.BlockPoolSlice.createRbwFile(BlockPoolSlice.java:306)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbwFile(FsVolumeImpl.java:933)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeImpl.createRbw(FsVolumeImpl.java:1202)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createRbw(FsDatasetImpl.java:1356)
at 
org.apache.hadoop.hdfs.server.datanode.BlockReceiver.(BlockReceiver.java:215)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.getBlockReceiver(DataXceiver.java:1291)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:758)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:173)
at 
org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:107)
at 
org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:290)
{noformat}

It is not known what file was being created.
What's interesting is that {{DatanodeUtil#createFileWithExistsCheck}} does 
carry file name in its log message, but the exception handler at 
{{DataTransfer#run()}} and {{BlockReceiver#BlockReceiver}} ignores it:

{code:title=BlockReceiver#BlockReceiver}
  // check if there is a disk error
  IOException cause = DatanodeUtil.getCauseIfDiskError(ioe);
  DataNode.LOG.warn("IOException in BlockReceiver constructor"
  + (cause == null ? "" : ". Cause is "), cause);
  if (cause != null) {
ioe = cause;
// Volume error check moved to FileIoProvider
  }
{code}
The logger should print the file name in addition to the cause.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12279) TestPipelinesFailover#testPipelineRecoveryStress fails due to race condition

2017-08-08 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12279:
--

 Summary: TestPipelinesFailover#testPipelineRecoveryStress fails 
due to race condition
 Key: HDFS-12279
 URL: https://issues.apache.org/jira/browse/HDFS-12279
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode, test
Reporter: Wei-Chiu Chuang


Saw a test failure in a precommit test
https://builds.apache.org/job/PreCommit-HDFS-Build/20600/testReport/org.apache.hadoop.hdfs.server.namenode.ha/TestPipelinesFailover/testPipelineRecoveryStress/

{noformat}
Error Message

Deferred
Stacktrace

java.lang.RuntimeException: Deferred
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestContext.checkException(MultithreadedTestUtil.java:130)
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestContext.stop(MultithreadedTestUtil.java:166)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness.shutdown(HAStressTestHarness.java:154)
at 
org.apache.hadoop.hdfs.server.namenode.ha.TestPipelinesFailover.testPipelineRecoveryStress(TestPipelinesFailover.java:493)
Caused by: java.lang.AssertionError: null
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.addBlocksToBeInvalidated(DatanodeDescriptor.java:641)
at 
org.apache.hadoop.hdfs.server.blockmanagement.InvalidateBlocks.invalidateWork(InvalidateBlocks.java:299)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.invalidateWorkForOneNode(BlockManager.java:4236)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeInvalidateWork(BlockManager.java:1736)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeInvalidationWork(BlockManagerTestUtil.java:169)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManagerTestUtil.computeAllPendingWork(BlockManagerTestUtil.java:185)
at 
org.apache.hadoop.hdfs.server.namenode.ha.HAStressTestHarness$1.doAnAction(HAStressTestHarness.java:102)
at 
org.apache.hadoop.test.MultithreadedTestUtil$RepeatingTestThread.doWork(MultithreadedTestUtil.java:222)
at 
org.apache.hadoop.test.MultithreadedTestUtil$TestingThread.run(MultithreadedTestUtil.java:189)
{noformat}


Studying the code, the assert can only fail due to a race condition that only 
happens in the test.

Specifically, the test uses BlockManagerTestUtil to call 
{{BlockManager#computeInvalidateWork}}, which gets 
{{invalidateBlocks.getDatanodes()}}. Afterwards, use the list to perform block 
invalidation via {{InvalidateBlocks#invalidateWork}}, which calls 
{{DatanodeDesriptor#addBlocksToBeInvalidated}} and there is an assertion to 
ensure the invalidation list is not empty. However, if the BlockManager 
performs block invalidation before 
{{DatanodeDesriptor#addBlocksToBeInvalidated}}, the invalidation list can be 
empty, because there's no proper lock to ensure atomicity.

This is not a problem for real cluster, because there is only one BlockManager 
per NameNode process, so the potential race condition is not exposed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Re: [VOTE] Release Apache Hadoop 2.7.4 (RC0)

2017-08-04 Thread Wei-Chiu Chuang

Hi,
I'm sorry coming to this vote late.
Daryn mentioned in HDFS-12136
 that HDFS-11160
 has a performance
regression at DataNode due to the way it locks dataset lock.

HDFS-11160 is in Hadoop 2.7.4. Would it be critical enough to warrant a
stop for release?

I myself can't reproduce the performance regression (assuming it only
occurs under extreme workload). Would Daryn or other Yahoo folks comment?

On Thu, Aug 3, 2017 at 10:31 PM, Akira Ajisaka  wrote:

> +1 (binding)
>
> - Verified the checksum and the signature of the source tarball
> - Built from source with CentOS 7.2 and OpenJDK 1.8.0_141
> - Built Hive 2.1.0/2.3.0 and Tez 0.8.5/0.9.0 with Hadoop 2.7.4 artifacts
> - Built single node cluster and ran some Hive on Tez queries successfully
>
> Regards,
> Akira
>
>
> On 2017/08/04 0:25, Kuhu Shukla wrote:
>
>> +1 (non-binding)
>>
>> 1. Verified signatures and digests.
>> 2. Built source.
>> 3. Installed on a pseudo-distributed cluster.
>> 4. Ran sample MR jobs and Tez example jobs like orderedwordcount
>> successfully.
>>
>> Thank you Konstantin and others for this release.
>>
>> Regards,
>> Kuhu
>>
>>
>>
>> On Thursday, August 3, 2017, 7:19:07 AM CDT, Sunil G 
>> wrote:
>>
>>
>> Thanks Konstantin
>>
>> +1 (binding)
>>
>> 1. Build tar ball from source package
>> 2. Ran basic MR jobs and verified UI.
>> 3. Enabled node labels and ran sleep job. Works fine.
>> 4. Verified CLI commands related to node labels and its working fine.
>> 5. RM WorkPreserving restart cases are also verified, and looks fine
>>
>> Thanks
>> Sunil
>>
>>
>>
>> On Sun, Jul 30, 2017 at 4:59 AM Konstantin Shvachko > >
>> wrote:
>>
>> Hi everybody,
>>>
>>> Here is the next release of Apache Hadoop 2.7 line. The previous stable
>>> release 2.7.3 was available since 25 August, 2016.
>>> Release 2.7.4 includes 264 issues fixed after release 2.7.3, which are
>>> critical bug fixes and major optimizations. See more details in Release
>>> Note:
>>> http://home.apache.org/~shv/hadoop-2.7.4-RC0/releasenotes.html
>>>
>>> The RC0 is available at: http://home.apache.org/~shv/hadoop-2.7.4-RC0/
>>>
>>> Please give it a try and vote on this thread. The vote will run for 5
>>> days
>>> ending 08/04/2017.
>>>
>>> Please note that my up to date public key are available from:
>>> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>>> Please don't forget to refresh the page if you've been there recently.
>>> There are other place on Apache sites, which may contain my outdated key.
>>>
>>> Thanks,
>>> --Konstantin
>>>
>>>
> -
> To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>
>


-- 
A very happy Clouderan

Re: How to restore data from HDFS rm -skipTrash

2017-08-04 Thread Wei-Chiu Chuang

If the directory has snapshot enabled, the file can be retrieved from the
past snapshots.

Otherwise, the file inodes are removed from namenode metadata, and blocks
are scheduled for deletion.
You might want to play with edit log a bit. Remove the delete entries from
edit logs. But it's hacky and does not guarantee the blocks are still there.

On Thu, Aug 3, 2017 at 8:38 PM, panfei  wrote:

> -- Forwarded message --
> From: panfei 
> Date: 2017-08-04 11:23 GMT+08:00
> Subject: How to restore data from HDFS rm -skipTrash
> To: CDH Users 
>
>
> some one mistakenly do a rm -skipTrash operation on the HDFS, but we stop
> the namenode and datanodes immediately. (CDH 5.4.5)
>
> I want to know is there any way to stop the deletion process ?
>
> and how ?
>
> thanks very in advance.
>

-- 
A very happy Hadoop contributor

[jira] [Created] (HDFS-12249) dfsadmin -metaSave to output maintenance mode blocks

2017-08-02 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12249:
--

 Summary: dfsadmin -metaSave to output maintenance mode blocks
 Key: HDFS-12249
 URL: https://issues.apache.org/jira/browse/HDFS-12249
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Wei-Chiu Chuang
Assignee: Wellington Chevreuil
Priority: Minor


Found while reviewing for HDFS-12182.

{quote}
After the patch, the output of metaSave is:
Live Datanodes: 0
Dead Datanodes: 0
Metasave: Blocks waiting for reconstruction: 0
Metasave: Blocks currently missing: 1
file16387: blk_0_1 MISSING (replicas: l: 0 d: 0 c: 2 e: 0)  
1.1.1.1:9866(corrupt) (block deletions maybe out of date) :  
2.2.2.2:9866(corrupt) (block deletions maybe out of date) : 
Mis-replicated blocks that have been postponed:
Metasave: Blocks being reconstructed: 0
Metasave: Blocks 0 waiting deletion from 0 datanodes.
Corrupt Blocks:
Block=0 Node=1.1.1.1:9866   StorageID=s1StorageState=NORMAL 
TotalReplicas=2 Reason=GENSTAMP_MISMATCH
Block=0 Node=2.2.2.2:9866   StorageID=s2StorageState=NORMAL 
TotalReplicas=2 Reason=GENSTAMP_MISMATCH
Metasave: Number of datanodes: 0
{quote}

{quote}
Looking at the output
The output is not user friendly — The meaning of "(replicas: l: 0 d: 0 c: 2 e: 
0)" is not obvious without looking at the code.
Also, it should print maintenance mode replicas.
{quote}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12245) Update INodeId javadoc

2017-08-01 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12245:
--

 Summary: Update INodeId javadoc
 Key: HDFS-12245
 URL: https://issues.apache.org/jira/browse/HDFS-12245
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Reporter: Wei-Chiu Chuang


The INodeId javadoc states that id 1 to 1000 is reserved and root inode id 
start from 1001. That is no longer true after HDFS-4434.

Also, it's a little weird in INodeId
{code}
  public static final long LAST_RESERVED_ID = 2 << 14 - 1;
  public static final long ROOT_INODE_ID = LAST_RESERVED_ID + 1;
{code}
It seems the intent was for LAST_RESERVED_ID to be (2^14) - 1 = 32767. But due 
to Java operator precedence, LAST_RESERVED_ID = 2^(14-1) = 16384. Maybe it 
doesn't matter, not sure.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12243) Trash emptier should use Time.monotonicNow()

2017-08-01 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12243:
--

 Summary: Trash emptier should use Time.monotonicNow()
 Key: HDFS-12243
 URL: https://issues.apache.org/jira/browse/HDFS-12243
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: fs
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12241) HttpFS to support overloaded FileSystem#rename API

2017-08-01 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12241:
--

 Summary: HttpFS to support overloaded FileSystem#rename API
 Key: HDFS-12241
 URL: https://issues.apache.org/jira/browse/HDFS-12241
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: httpfs
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang


Httpfs is essentially the parity of webhdfs. But it does not implement 
{{FileSystem#rename(final Path src, final Path dst, final Rename... options)}}, 
which mean it does not support trash.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12240) Document WebHDFS rename API parameter renameoptions

2017-08-01 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12240:
--

 Summary: Document WebHDFS rename API parameter renameoptions
 Key: HDFS-12240
 URL: https://issues.apache.org/jira/browse/HDFS-12240
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang


The {{FileSystem#rename}} API has an overloaded version that carries an extra 
parameter "renameoptions". The extra parameter can be used to support trash or 
support overwriting.

The WebHDFS Rest API does not document this parameter, so file this jira to get 
it documented.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-10799) NameNode should use loginUser(hdfs) to serve iNotify requests

2017-07-27 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-10799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-10799.

Resolution: Won't Fix

Close this jira because the proposed solution does not seem appropriate. As I 
explained earlier, the correct fix for this problem should be at the client 
side, which is supposed to renew Kerberos credential before it expires.

> NameNode should use loginUser(hdfs) to serve iNotify requests
> -
>
> Key: HDFS-10799
> URL: https://issues.apache.org/jira/browse/HDFS-10799
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.6.0
> Environment: Kerberized, HA cluster, iNotify client, CDH5.7.0
>Reporter: Wei-Chiu Chuang
>    Assignee: Wei-Chiu Chuang
> Attachments: HDFS-10799.001.patch
>
>
> When a NameNode serves iNotify requests from a client, it verifies the client 
> has superuser permission and then uses the client's Kerberos principal to 
> read edits from journal nodes.
> However, if the client does not renew its tgt tickets, the connection from 
> NameNode to journal nodes may fail. In which case, the NameNode thinks the 
> edits are corrupt, and prints a scary error message:
> "During automatic edit log failover, we noticed that all of the remaining 
> edit log streams are shorter than the current one!  The best remaining edit 
> log ends at transaction 11577603, but we thought we could read up to 
> transaction 11577606.  If you continue, metadata will be lost forever!"
> However, the edits are actually good. NameNode _should not freak out when an 
> iNotify client's tgt ticket expires_.
> I think that an easy solution to this bug, is that after NameNode verifies 
> client has superuser permission, call {{SecurityUtil.doAsLoginUser}} and then 
> read edits. This will make sure the operation does not fail due to an expired 
> client ticket.
> Excerpt of related logs:
> {noformat}
> 2016-08-18 19:05:13,979 WARN org.apache.hadoop.security.UserGroupInformation: 
> PriviledgedActionException as:h...@example.com (auth:KERBEROS) 
> cause:java.io.IOException: We encountered an error reading 
> http://jn1.example.com:8480/getJournal?jid=nameservice1=11577487=yyy,
>  
> http://jn1.example.com:8480/getJournal?jid=nameservice1=11577487=yyy.
>   During automatic edit log failover, we noticed that all of the remaining 
> edit log streams are shorter than the current one!  The best remaining edit 
> log ends at transaction 11577603, but we thought we could read up to 
> transaction 11577606.  If you continue, metadata will be lost forever!
> 2016-08-18 19:05:13,979 INFO org.apache.hadoop.ipc.Server: IPC Server handler 
> 112 on 8020, call 
> org.apache.hadoop.hdfs.protocol.ClientProtocol.getEditsFromTxid from [client 
> IP:port] Call#73 Retry#0
> java.io.IOException: We encountered an error reading 
> http://jn1.example.com:8480/getJournal?jid=nameservice1=11577487=yyy,
>  
> http://jn1.example.com:8480/getJournal?jid=nameservice1=11577487=yyy.
>   During automatic edit log failover, we noticed that all of the remaining 
> edit log streams are shorter than the current one!  The best remaining edit 
> log ends at transaction 11577603, but we thought we could read up to 
> transaction 11577606.  If you continue, metadata will be lost forever!
> at 
> org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.nextOp(RedundantEditLogInputStream.java:213)
> at 
> org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readOp(EditLogInputStream.java:85)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.readOp(NameNodeRpcServer.java:1674)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getEditsFromTxid(NameNodeRpcServer.java:1736)
> at 
> org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.getEditsFromTxid(AuthorizationProviderProxyClientProtocol.java:1010)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getEditsFromTxid(ClientNamenodeProtocolServerSideTranslatorPB.java:1475)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
> at org.apache.hadoop.ipc.Server$Handler$1.r

[jira] [Created] (HDFS-12186) Add INodeAttributeProvider startup progress into HDFS Web UI

2017-07-21 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12186:
--

 Summary: Add INodeAttributeProvider startup progress into HDFS Web 
UI 
 Key: HDFS-12186
 URL: https://issues.apache.org/jira/browse/HDFS-12186
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: ui
Reporter: Wei-Chiu Chuang


For a cluster of substantial size, INodeAttributeProvider may take a long time. 
We saw a large cluster where part of file system ACLs is managed by Apache 
Sentry, and NameNode took a few minutes for the Sentry HDFS NameNode Plugin to 
initialize. I suppose the same issue can arise for Apache Ranger and other 
INodeAttributeProviders implementation.

It would be nice to add an extra row in NameNode Web UI startup progress, in 
addition to "Loading fsimage", "Loading edits", "Saving checkpoint" and "Safe 
mode", to give a better visibility what NameNode is doing.

In addition, there might also be a need to add a similar row into Web UI for 
loading NameNode plugins. So filing this jira to invite more discussion.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12176) dfsadmin shows DFS Used%: NaN% if the cluster has zero block.

2017-07-20 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12176:
--

 Summary: dfsadmin shows DFS Used%: NaN% if the cluster has zero 
block.
 Key: HDFS-12176
 URL: https://issues.apache.org/jira/browse/HDFS-12176
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang
Priority: Trivial


This is rather a non-issue, but thought I should file it anyway.

I have a test cluster with just NN fsimage, no DN, no blocks, and dfsadmin 
shows:

{noformat}
$ hdfs dfsadmin -report
Configured Capacity: 0 (0 B)
Present Capacity: 0 (0 B)
DFS Remaining: 0 (0 B)
DFS Used: 0 (0 B)
DFS Used%: NaN%
{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12165) getSnapshotDiffReport throws NegativeArraySizeException for very large snapshot diff summary

2017-07-19 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12165:
--

 Summary: getSnapshotDiffReport throws NegativeArraySizeException 
for very large snapshot diff summary
 Key: HDFS-12165
 URL: https://issues.apache.org/jira/browse/HDFS-12165
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Wei-Chiu Chuang


For a really large snapshot diff, getSnapshotDiffReport throws 
NegativeArraySizeException
{noformat}
2017-07-19 11:14:16,415 WARN org.apache.hadoop.ipc.Server: Error serializing 
call response for call 
org.apache.hadoop.hdfs.protocol.ClientProtocol.getSnapshotDiffReport
 from 10.17.211.10:58223 Call#0 Retry#0
java.lang.NegativeArraySizeException
at 
com.google.protobuf.CodedOutputStream.newInstance(CodedOutputStream.java:105)
at 
com.google.protobuf.AbstractMessageLite.writeDelimitedTo(AbstractMessageLite.java:87)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$RpcResponseWrapper.write(ProtobufRpcEngine.java:468)
at org.apache.hadoop.ipc.Server.setupResponse(Server.java:2410)
at org.apache.hadoop.ipc.Server.access$500(Server.java:134)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2182)
{noformat}

This particular snapshot diff contains more than 25 million different file 
system objects, and which means the serialized response can be more than 2GB, 
overflowing protobuf length calculation.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12112) TestBlockManager#testBlockManagerMachinesArray sometimes fails with NPE

2017-07-10 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12112:
--

 Summary: TestBlockManager#testBlockManagerMachinesArray sometimes 
fails with NPE
 Key: HDFS-12112
 URL: https://issues.apache.org/jira/browse/HDFS-12112
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.0.0-beta1
 Environment: CDH5.12.0
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Minor


Found the following error:
{quote}
java.lang.NullPointerException: null
at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testBlockManagerMachinesArray(TestBlockManager.java:1202)
{quote}
The NPE suggests corruptStorageDataNode in the following code snippet could be 
null.
{code}
for(int i=0; i<corruptStorageDataNode.getStorageInfos().length; i++) {
{code}

Looking at the code, the test does not wait for file replication to happen, 
which is why corruptStorageDataNode (the DN of the second replica) is null.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12062) removeErasureCodingPolicy needs super user permission

2017-06-28 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12062:
--

 Summary: removeErasureCodingPolicy needs super user permission
 Key: HDFS-12062
 URL: https://issues.apache.org/jira/browse/HDFS-12062
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: erasure-coding
Reporter: Wei-Chiu Chuang
Assignee: Wei-Chiu Chuang
Priority: Critical


Currently {{NameNodeRPCServer#removeErasureCodingPolicy}} does not require 
super user permission. This is not appropriate as 
{{NameNodeRPCServer#addErasureCodingPolicies}} requires super user permission.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12061) Add TraceScope for multiple DFSClient EC operations

2017-06-28 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12061:
--

 Summary: Add TraceScope for multiple DFSClient EC operations
 Key: HDFS-12061
 URL: https://issues.apache.org/jira/browse/HDFS-12061
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: hdfs
Affects Versions: 3.0.0-alpha4
Reporter: Wei-Chiu Chuang
Priority: Minor


A number of DFSClient EC operations, including addErasureCodingPolicies, 
removeErasureCodingPolicy, enableErasureCodingPolicy, 
disableErasureCodingPolicy does not have TraceScope similar to this:
{code}
try (TraceScope ignored = tracer.newScope("getErasureCodingCodecs")) {
}
{code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Created] (HDFS-12036) Add audit log for getErasureCodingPolicy, getErasureCodingPolicies, getErasureCodingCodecs

2017-06-26 Thread Wei-Chiu Chuang (JIRA)

Wei-Chiu Chuang created HDFS-12036:
--

 Summary: Add audit log for getErasureCodingPolicy, 
getErasureCodingPolicies, getErasureCodingCodecs
 Key: HDFS-12036
 URL: https://issues.apache.org/jira/browse/HDFS-12036
 Project: Hadoop HDFS
  Issue Type: Improvement
  Components: namenode
Affects Versions: 3.0.0-alpha4
Reporter: Wei-Chiu Chuang


These three FSNameSystem operations doe not record audit logs. I am not sure 
how useful these audit logs would be, but thought I should file them so that 
they don't get dropped if they turn out to be needed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

[jira] [Resolved] (HDFS-11661) GetContentSummary uses excessive amounts of memory

2017-05-24 Thread Wei-Chiu Chuang (JIRA)


 [ 
https://issues.apache.org/jira/browse/HDFS-11661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wei-Chiu Chuang resolved HDFS-11661.

   Resolution: Fixed
 Hadoop Flags: Reviewed
Fix Version/s: 2.8.1
   3.0.0-alpha3
 Release Note: Reverted HDFS-10797 to fix a scalability regression brought 
by the commit.

Based on multiple +1, I reverted the commit from branch-2.8, branch-2 and trunk.

Thanks to [~nroberts] for reporting the issue, and comments from [~kihwal], 
[~mackrorysd], [~xiaochen] [~djp] [~andrew.wang] [~shahrs87] [~yzhangal] and 
[~daryn].

[~daryn] thanks for your effort trying to fix the bug. Please file a new jira 
for your patch. Thanks!

> GetContentSummary uses excessive amounts of memory
> --
>
> Key: HDFS-11661
> URL: https://issues.apache.org/jira/browse/HDFS-11661
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 2.8.0, 3.0.0-alpha2
>Reporter: Nathan Roberts
>Assignee: Wei-Chiu Chuang
>Priority: Blocker
> Fix For: 3.0.0-alpha3, 2.8.1
>
> Attachments: HDFS-11661.001.patch, HDFs-11661.002.patch, Heap 
> growth.png
>
>
> ContentSummaryComputationContext::nodeIncluded() is being used to keep track 
> of all INodes visited during the current content summary calculation. This 
> can be all of the INodes in the filesystem, making for a VERY large hash 
> table. This simply won't work on large filesystems. 
> We noticed this after upgrading a namenode with ~100Million filesystem 
> objects was spending significantly more time in GC. Fortunately this system 
> had some memory breathing room, other clusters we have will not run with this 
> additional demand on memory.
> This was added as part of HDFS-10797 as a way of keeping track of INodes that 
> have already been accounted for - to avoid double counting.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

< 1 2 3 4 5 6 7 8 >

501 - 600 of 735 matches

Mail list logo