[jira] [Resolved] (YARN-9183) TestAMRMTokens fails

2019-01-09 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-9183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-9183.
--
Resolution: Done

HDFS-14084 was reverted so this should now be fixed.

> TestAMRMTokens fails
> 
>
> Key: YARN-9183
> URL: https://issues.apache.org/jira/browse/YARN-9183
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Akira Ajisaka
>Assignee: Abhishek Modi
>Priority: Blocker
>
> TestAMRMTokens.testMasterKeyRollOver and TestAMRMTokens.testTokenExpiry is 
> failing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 3.2.0 - RC0

2018-11-28 Thread Jason Lowe
Thanks for driving this release, Sunil!

+1 (binding)

- Verified signatures and digests
- Successfully performed a native build
- Deployed a single-node cluster
- Ran some sample jobs

Jason

On Fri, Nov 23, 2018 at 6:07 AM Sunil G  wrote:

> Hi folks,
>
>
>
> Thanks to all contributors who helped in this release [1]. I have created
>
> first release candidate (RC0) for Apache Hadoop 3.2.0.
>
>
> Artifacts for this RC are available here:
>
> http://home.apache.org/~sunilg/hadoop-3.2.0-RC0/
>
>
>
> RC tag in git is release-3.2.0-RC0.
>
>
>
> The maven artifacts are available via repository.apache.org at
>
> https://repository.apache.org/content/repositories/orgapachehadoop-1174/
>
>
> This vote will run 7 days (5 weekdays), ending on Nov 30 at 11:59 pm PST.
>
>
>
> 3.2.0 contains 1079 [2] fixed JIRA issues since 3.1.0. Below feature
> additions
>
> are the highlights of this release.
>
> 1. Node Attributes Support in YARN
>
> 2. Hadoop Submarine project for running Deep Learning workloads on YARN
>
> 3. Support service upgrade via YARN Service API and CLI
>
> 4. HDFS Storage Policy Satisfier
>
> 5. Support Windows Azure Storage - Blob file system in Hadoop
>
> 6. Phase 3 improvements for S3Guard and Phase 5 improvements S3a
>
> 7. Improvements in Router-based HDFS federation
>
>
>
> Thanks to Wangda, Vinod, Marton for helping me in preparing the release.
>
> I have done few testing with my pseudo cluster. My +1 to start.
>
>
>
> Regards,
>
> Sunil
>
>
>
> [1]
>
>
> https://lists.apache.org/thread.html/68c1745dcb65602aecce6f7e6b7f0af3d974b1bf0048e7823e58b06f@%3Cyarn-dev.hadoop.apache.org%3E
>
> [2] project in (YARN, HADOOP, MAPREDUCE, HDFS) AND fixVersion in (3.2.0)
> AND fixVersion not in (3.1.0, 3.0.0, 3.0.0-beta1) AND status = Resolved
> ORDER BY fixVersion ASC
>


Re: [VOTE] Release Apache Hadoop 2.9.2 (RC0)

2018-11-19 Thread Jason Lowe
Thanks for driving this release, Akira!

+1 (binding)

- Verified signatures and digests
- Successfully performed native build from source
- Deployed a single-node cluster and ran some sample jobs

Jason

On Tue, Nov 13, 2018 at 7:02 PM Akira Ajisaka  wrote:

> Hi folks,
>
> I have put together a release candidate (RC0) for Hadoop 2.9.2. It
> includes 204 bug fixes and improvements since 2.9.1. [1]
>
> The RC is available at http://home.apache.org/~aajisaka/hadoop-2.9.2-RC0/
> Git signed tag is release-2.9.2-RC0 and the checksum is
> 826afbeae31ca687bc2f8471dc841b66ed2c6704
> The maven artifacts are staged at
> https://repository.apache.org/content/repositories/orgapachehadoop-1166/
>
> You can find my public key at:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>
> Please try the release and vote. The vote will run for 5 days.
>
> [1] https://s.apache.org/2.9.2-fixed-jiras
>
> Thanks,
> Akira
>
> -
> To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
>
>


[jira] [Created] (YARN-9014) OCI/squashfs container runtime

2018-11-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-9014:


 Summary: OCI/squashfs container runtime
 Key: YARN-9014
 URL: https://issues.apache.org/jira/browse/YARN-9014
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jason Lowe
Assignee: Jason Lowe


This JIRA tracks a YARN container runtime that supports running containers in 
images built by Docker but the runtime does not use Docker directly, and Docker 
does not have to be installed on the nodes.  The runtime leverages the [OCI 
runtime standard|https://github.com/opencontainers/runtime-spec] to launch 
containers, so an OCI-compliant runtime like {{runc}} is required.  {{runc}} 
has the benefit of not requiring a daemon like {{dockerd}} to be running in 
order to launch/control containers.

The layers comprising the Docker image are uploaded to HDFS as 
[squashfs|http://tldp.org/HOWTO/SquashFS-HOWTO/whatis.html] images, enabling 
the runtime to efficiently download and execute directly on the compressed 
layers.  This saves image unpack time and space on the local disk.  The image 
layers, like other entries in the YARN distributed cache, can be spread across 
the YARN local disks, increasing the available space for storing container 
images on each node.

A design document will be posted shortly.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8937) TestLeaderElectorService hangs

2018-10-23 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8937:


 Summary: TestLeaderElectorService hangs
 Key: YARN-8937
 URL: https://issues.apache.org/jira/browse/YARN-8937
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Jason Lowe


TestLeaderElectorService hangs waiting for the TestingZooKeeperServer to start 
and eventually gets killed by the surefire timeout.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8928) TestRMAdminService is failing

2018-10-22 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8928:


 Summary: TestRMAdminService is failing
 Key: YARN-8928
 URL: https://issues.apache.org/jira/browse/YARN-8928
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.3.0
Reporter: Jason Lowe


After HADOOP-15836 TestRMAdminService has started failing consistently.  Sample 
stacktraces to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8856) TestTimelineReaderWebServicesHBaseStorage tests failing with NoClassDefFoundError

2018-10-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8856:


 Summary: TestTimelineReaderWebServicesHBaseStorage tests failing 
with NoClassDefFoundError
 Key: YARN-8856
 URL: https://issues.apache.org/jira/browse/YARN-8856
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe


TestTimelineReaderWebServicesHBaseStorage has been failing in nightly builds 
with NoClassDefFoundError in the tests.  Sample error and stacktrace to follow.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6091) the AppMaster register failed when use Docker on LinuxContainer

2018-10-03 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-6091.
--
   Resolution: Implemented
Fix Version/s: 3.1.1
   3.2.0

Closing this as fixed by YARN-7654.

> the AppMaster register failed when use Docker on LinuxContainer 
> 
>
> Key: YARN-6091
> URL: https://issues.apache.org/jira/browse/YARN-6091
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager, yarn
>Affects Versions: 2.8.1
> Environment: CentOS
>Reporter: zhengchenyu
>Assignee: Eric Badger
>Priority: Critical
>  Labels: Docker
> Fix For: 3.2.0, 3.1.1
>
> Attachments: YARN-6091.001.patch, YARN-6091.002.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> In some servers, When I use Docker on LinuxContainer, I found the aciton that 
> AppMaster register to Resourcemanager failed. But didn't happen in other 
> servers. 
> I found the pclose (in container-executor.c) return different value in 
> different server, even though the process which is launched by popen is 
> running normally. Some server return 0, and others return 13. 
> Because yarn regard the application as failed application when pclose return 
> nonzero, and yarn will remove the AMRMToken, then the AppMaster register 
> failed because Resourcemanager have removed this applicaiton's token. 
> In container-executor.c, the judgement condition is whether the return code 
> is zero. But man the pclose, the document tells that "pclose return -1" 
> represent wrong. So I change the judgement condition, then slove this 
> problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8784) DockerLinuxContainerRuntime prevents access to distributed cache entries on a full disk

2018-09-17 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8784:


 Summary: DockerLinuxContainerRuntime prevents access to 
distributed cache entries on a full disk
 Key: YARN-8784
 URL: https://issues.apache.org/jira/browse/YARN-8784
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.1.1, 3.2.0
Reporter: Jason Lowe


DockerLinuxContainerRuntime bind mounts the filecache and usercache directories 
into the container to allow tasks to access entries in the distributed cache.  
However it only bind mounts  directories on disks that are considered good, and 
disks that are full or bad are not in that list.  If a container tries to run 
with a distributed cache entry that has been previously localized to a disk 
that is now considered full/bad, the dist cache directory will _not_ be 
bind-mounted into the container's filesystem namespace.  At that point any 
symlinks in the container's current working directory that point to those disks 
will reference invalid paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Next Hadoop Contributors Meetup on September 25th

2018-09-13 Thread Jason Lowe
I am happy to announce that Oath will be hosting the next Hadoop
Contributors meetup on Tuesday, September 25th at Yahoo Building G, 589
Java Drive, Sunnyvale CA from 8:00AM to 6:00PM.

The agenda will look roughly as follows:

08:00AM - 08:30AM Arrival and Check-in
08:30AM - 12:00PM A series of brief talks with some of the topics including:
  - HDFS scalability and security
  - Use cases and future directions for Docker on YARN
  - Submarine (Deep Learning on YARN)
  - Hadoop in the cloud
  - Oath's use of machine learning, Vespa, and Storm
11:45PM - 12:30PM Lunch Break
12:30PM - 02:00PM Brief talks series resume
02:00PM - 04:30PM Parallel breakout sessions to discuss topics suggested by
attendees.  Some proposed topics include:
  - Improved security credentials management for long-running YARN
applications
  - Improved management of parallel development lines
  - Proposals for the next bug bash
  - Tez shuffle handler and DAG aware scheduler overview
 04:30PM - 06:00PM Closing Reception

RSVP at https://www.meetup.com/Hadoop-Contributors/events/254012512/ is
REQUIRED to attend and spots are limited.  Security will be checking the
attendee list as you enter the building.

We will host a Google Hangouts/Meet so people who are interested but unable
to attend in person can participate remotely.  Details will be posted to
the meetup event.

Hope to see you there!

Jason


Re: [VOTE] Release Apache Hadoop 2.8.5 (RC0)

2018-09-10 Thread Jason Lowe
Thanks for driving the release, Junping!

+1 (binding)

- Verified signatures and digests
- Successfully performed a native build from source
- Successfully deployed a single-node cluster with the timeline server
- Ran some sample jobs and examined the web UI and job logs

Jason

On Mon, Sep 10, 2018 at 7:00 AM, 俊平堵  wrote:

> Hi all,
>
>  I've created the first release candidate (RC0) for Apache
> Hadoop 2.8.5. This is our next point release to follow up 2.8.4. It
> includes 33 important fixes and improvements.
>
>
> The RC artifacts are available at:
> http://home.apache.org/~junping_du/hadoop-2.8.5-RC0
>
>
> The RC tag in git is: release-2.8.5-RC0
>
>
>
> The maven artifacts are available via repository.apache.org<
> http://repository.apache.org> at:
>
> https://repository.apache.org/content/repositories/orgapachehadoop-1140
>
>
> Please try the release and vote; the vote will run for the usual 5
> working
> days, ending on 9/15/2018 PST time.
>
>
> Thanks,
>
>
> Junping
>


[jira] [Created] (YARN-8730) TestRMWebServiceAppsNodelabel#testAppsRunning fails

2018-08-29 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8730:


 Summary: TestRMWebServiceAppsNodelabel#testAppsRunning fails
 Key: YARN-8730
 URL: https://issues.apache.org/jira/browse/YARN-8730
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe


TestRMWebServiceAppsNodelabel is failing in branch-2.8:
{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 8.473 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel
testAppsRunning(org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel)
  Time elapsed: 6.708 sec  <<< FAILURE!
org.junit.ComparisonFailure: partition amused 
expected:<{"[]memory":1024,"vCores...> but 
was:<{"[res":{"memory":1024,"memorySize":1024,"virtualCores":1},"]memory":1024,"vCores...>
at org.junit.Assert.assertEquals(Assert.java:115)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.verifyResource(TestRMWebServiceAppsNodelabel.java:222)
at 
org.apache.hadoop.yarn.server.resourcemanager.webapp.TestRMWebServiceAppsNodelabel.testAppsRunning(TestRMWebServiceAppsNodelabel.java:205)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8703) Localized resource may leak on disk if container is killed while localizing

2018-08-23 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8703:


 Summary: Localized resource may leak on disk if container is 
killed while localizing
 Key: YARN-8703
 URL: https://issues.apache.org/jira/browse/YARN-8703
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jason Lowe


If a container is killed while localizing then it releases all of its 
resources.  If the resource count goes to zero and it is in the DOWNLOADING 
state then the resource bookkeeping is removed in the resource tracker.  
Shortly afterwards the localizer could heartbeat in and report the successful 
localization of the resource that was just removed.  When the 
LocalResourcesTrackerImpl receives the LOCALIZED event but does not find the 
corresponding LocalResource for the event then it simply logs a "localized 
without a location" warning.  At that point I think the localized resource has 
been leaked on the disk since the NM has removed bookkeeping for the resource 
without removing it on disk.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8672) TestContainerManager#testLocalingResourceWhileContainerRunning occasionally times out

2018-08-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8672:


 Summary: 
TestContainerManager#testLocalingResourceWhileContainerRunning occasionally 
times out
 Key: YARN-8672
 URL: https://issues.apache.org/jira/browse/YARN-8672
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.2.0
Reporter: Jason Lowe


Precommit builds have been failing in 
TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been 
able to reproduce the problem without any patch applied if I run the test 
enough times.  It looks like something is removing container tokens from the 
nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8385) Clean local directories when a container is killed

2018-07-09 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-8385.
--
Resolution: Invalid

Closing this as invalid since YARN is deleting the container directory and 
leaving the application directory as designed.  This appears to be a problem 
with the application rather than a problem with YARN.

> Clean local directories when a container is killed
> --
>
> Key: YARN-8385
> URL: https://issues.apache.org/jira/browse/YARN-8385
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.0
>Reporter: Marco Gaido
>Priority: Major
>
> In long running applications, it may happen that many containers are created 
> and killed. A use case is Spark Thrift Server when dynamic allocation is 
> enabled. A lot of containers are killed and the application keeps running 
> indefinitely.
> Currently, YARN seems to remove the local directories only when the whole 
> application terminates. In the scenario described above, this can cause 
> serious resource leakages. Please, check 
> https://issues.apache.org/jira/browse/SPARK-22575.
> I think YARN should clean up all the local directories of a container when it 
> is killed and not when the whole application terminates.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8462) Resource Manager shutdown with FATAL Exception

2018-07-06 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-8462.
--
Resolution: Duplicate

This is being handled by YARN-8193 with a new branch-2 patch posted there.

> Resource Manager shutdown with FATAL Exception
> --
>
> Key: YARN-8462
> URL: https://issues.apache.org/jira/browse/YARN-8462
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Intermediately Resource manager going down with following exceptions 
>  
> 2018-06-25 15:24:30,572 FATAL event.EventDispatcher 
> (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to 
> the Event Dispatcher
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.getLocalityWaitFactor(RegularContainerAllocator.java:268)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:315)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1338)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1333)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1422)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1197)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1059)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1464)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:150)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-25 15:24:30,573 INFO  event.EventDispatcher 
> (EventDispatcher.java:run(79)) - Exiting, bbye..
> 2018-06-25 15:24:30,579 ERROR delegation.AbstractDele

[jira] [Created] (YARN-8473) Containers being launched as app tears down can leave containers in NEW state

2018-06-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8473:


 Summary: Containers being launched as app tears down can leave 
containers in NEW state
 Key: YARN-8473
 URL: https://issues.apache.org/jira/browse/YARN-8473
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.4
Reporter: Jason Lowe
Assignee: Jason Lowe


I saw a case where containers were stuck on a nodemanager in the NEW state 
because they tried to launch just as an application was tearing down.  The 
container sent an INIT_CONTAINER event to the ApplicationImpl which then 
executed an invalid transition since that event is not handled/expected when 
the application is in the process of tearing down.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8462) Resource Manager shutdown with FATAL Exception

2018-06-26 Thread Jason Lowe (JIRA)


 [ 
https://issues.apache.org/jira/browse/YARN-8462?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-8462.
--
Resolution: Duplicate

> Resource Manager shutdown with FATAL Exception
> --
>
> Key: YARN-8462
> URL: https://issues.apache.org/jira/browse/YARN-8462
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: capacity scheduler, resourcemanager
>Affects Versions: 2.9.0
>Reporter: Amithsha
>Priority: Critical
>
> Intermediately Resource manager going down with following exceptions 
>  
> 2018-06-25 15:24:30,572 FATAL event.EventDispatcher 
> (EventDispatcher.java:run(75)) - Error in handling event type NODE_UPDATE to 
> the Event Dispatcher
> java.lang.NullPointerException
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.getLocalityWaitFactor(RegularContainerAllocator.java:268)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.canAssign(RegularContainerAllocator.java:315)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignOffSwitchContainers(RegularContainerAllocator.java:388)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainersOnNode(RegularContainerAllocator.java:469)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.tryAllocateOnNode(RegularContainerAllocator.java:250)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.allocate(RegularContainerAllocator.java:819)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator.assignContainers(RegularContainerAllocator.java:857)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.ContainerAllocator.assignContainers(ContainerAllocator.java:55)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.assignContainers(FiCaSchedulerApp.java:868)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:1121)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:734)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:558)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1338)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1333)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1422)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1197)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.nodeUpdate(CapacityScheduler.java:1059)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1464)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:150)
>         at 
> org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:66)
>         at java.lang.Thread.run(Thread.java:745)
> 2018-06-25 15:24:30,573 INFO  event.EventDispatcher 
> (EventDispatcher.java:run(79)) - Exiting, bbye..
> 2018-06-25 15:24:30,579 ERROR delegation.AbstractDelegationTokenSecretManager 
> (AbstractDelegationTokenSecretManager.java:run(6

[jira] [Created] (YARN-8375) TestCGroupElasticMemoryController fails surefire build

2018-05-29 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8375:


 Summary: TestCGroupElasticMemoryController fails surefire build
 Key: YARN-8375
 URL: https://issues.apache.org/jira/browse/YARN-8375
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.2.0
Reporter: Jason Lowe


hadoop-yarn-server-nodemanager precommit builds have been failing unit tests 
recently because TestCGroupElasticMemoryController is either exiting or timing 
out.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8374) Upgrade objenesis dependency

2018-05-29 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8374:


 Summary: Upgrade objenesis dependency
 Key: YARN-8374
 URL: https://issues.apache.org/jira/browse/YARN-8374
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: timelineservice
Reporter: Jason Lowe


After HADOOP-14918 is committed we should be able to remove the explicit 
objenesis dependency and objenesis exclusion from the fst dependency to pick up 
the version fst wants naturally.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-8358) ResourceManager restart fail to recover due to TimelineServiceV1Publisher NPE

2018-05-24 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-8358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-8358.
--
Resolution: Duplicate

> ResourceManager restart fail to recover due to TimelineServiceV1Publisher NPE
> -
>
> Key: YARN-8358
> URL: https://issues.apache.org/jira/browse/YARN-8358
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: resourcemanager
>Affects Versions: 2.9.1
> Environment: Ubuntu 16.04
> java version "1.8.0_91"
>Reporter: Che Yufei
>Priority: Major
>
> I'm upgrading from Hadoop 2.7.3 to 2.9.1. ResourceManager restart works fine 
> for 2.7.3, but fails on 2.9.1.
> I'm using LevelDB as the RM state store, the problem seems related to 
> TimelineServiceV1Publisher. If I set 
> yarn.resourcemanager.system-metrics-publisher.enabled to false, then recovery 
> works fine. But if the option is set to true, RM fails to start with the 
> following log:
>  
> {{2018-05-24 23:11:54,597 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Recovery 
> started}}
> {{2018-05-24 23:11:54,673 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Loaded 
> RM state version info 1.1}}
> {{2018-05-24 23:11:54,688 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
> Recovered 12 RM delegation token master keys}}
> {{2018-05-24 23:11:54,688 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
> Recovered 0 RM delegation tokens}}
> {{2018-05-24 23:11:54,990 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
> Recovered 2099 applications and 2100 application attempts}}
> {{2018-05-24 23:11:54,998 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.recovery.LeveldbRMStateStore: 
> Recovered 0 reservations}}
> {{2018-05-24 23:11:54,998 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMDelegationTokenSecretManager:
>  recovering RMDelegationTokenSecretManager.}}
> {{2018-05-24 23:11:55,003 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Recovering 2099 
> applications}}
> {{2018-05-24 23:11:55,107 INFO 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager: Successfully 
> recovered 0 out of 2099 applications}}
> {{2018-05-24 23:11:55,108 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Failed to 
> load/recover state}}
> {{java.lang.NullPointerException}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.metrics.TimelineServiceV1Publisher.appCreated(TimelineServiceV1Publisher.java:90)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.sendATSCreateEvent(RMAppImpl.java:1954)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.recover(RMAppImpl.java:931)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1061)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl$RMAppRecoveredTransition.transition(RMAppImpl.java:1054)}}
> {{ at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)}}
> {{ at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)}}
> {{ at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)}}
> {{ at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl.handle(RMAppImpl.java:878)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:339)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:533)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1394)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:758)}}
> {{ at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startActiveServices(ResourceManager.java:1147)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$1.run(ResourceManager.java:1187)}}
> {{ at 
> org.apache.hadoop.yarn.server.resourcemanage

[jira] [Created] (YARN-8284) get_docker_command refactoring

2018-05-11 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8284:


 Summary: get_docker_command refactoring
 Key: YARN-8284
 URL: https://issues.apache.org/jira/browse/YARN-8284
 Project: Hadoop YARN
  Issue Type: Improvement
Affects Versions: 3.2.0, 3.1.1
Reporter: Jason Lowe


YARN-8274 occurred because get_docker_command's helper functions each have to 
remember to put the docker binary as the first argument.  This is error prone 
and causes code duplication for each of the helper functions.  It would be 
safer and simpler if get_docker_command initialized the docker binary argument 
in one place and each of the helper functions only added the arguments specific 
to their particular docker sub-command.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8263) DockerClient still touches hadoop.tmp.dir

2018-05-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8263:


 Summary: DockerClient still touches hadoop.tmp.dir
 Key: YARN-8263
 URL: https://issues.apache.org/jira/browse/YARN-8263
 Project: Hadoop YARN
  Issue Type: Sub-task
Affects Versions: 3.1.1
Reporter: Jason Lowe






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.7.6 (RC0)

2018-04-16 Thread Jason Lowe
Thanks for driving the release, Konstatin!

+1 (binding)

- Verified signatures and digests
- Completed a native build from source
- Deployed a single-node cluster
- Ran some sample jobs

Jason

On Mon, Apr 9, 2018 at 6:14 PM, Konstantin Shvachko
 wrote:
> Hi everybody,
>
> This is the next dot release of Apache Hadoop 2.7 line. The previous one 2.7.5
> was released on December 14, 2017.
> Release 2.7.6 includes critical bug fixes and optimizations. See more
> details in Release Note:
> http://home.apache.org/~shv/hadoop-2.7.6-RC0/releasenotes.html
>
> The RC0 is available at: http://home.apache.org/~shv/hadoop-2.7.6-RC0/
>
> Please give it a try and vote on this thread. The vote will run for 5 days
> ending 04/16/2018.
>
> My up to date public key is available from:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>
> Thanks,
> --Konstantin

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8147) TestClientRMService#testGetApplications sporadically fails

2018-04-11 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8147:


 Summary: TestClientRMService#testGetApplications sporadically fails
 Key: YARN-8147
 URL: https://issues.apache.org/jira/browse/YARN-8147
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe
Assignee: Jason Lowe


testGetApplications can fail sporadically when testing start time filters on 
the request, e.g.:
{noformat}
java.lang.AssertionError: Incorrect number of matching start range expected:<0> 
but was:<1>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at 
org.apache.hadoop.yarn.server.resourcemanager.TestClientRMService.testGetApplications(TestClientRMService.java:798)
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-8120) JVM can crash with SIGSEGV when exiting due to custom leveldb logger

2018-04-05 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-8120:


 Summary: JVM can crash with SIGSEGV when exiting due to custom 
leveldb logger
 Key: YARN-8120
 URL: https://issues.apache.org/jira/browse/YARN-8120
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager, resourcemanager, timelineserver
Reporter: Jason Lowe
Assignee: Jason Lowe


The JVM can crash upon exit with a SIGSEGV when leveldb is configured with a 
custom user logger as is done with LeveldbLogger.  See 
https://github.com/fusesource/leveldbjni/issues/36 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: Apache Hadoop 3.0.1 Release plan

2018-01-09 Thread Jason Lowe
Is it necessary to cut the branch so far ahead of the release?  branch-3.0
is already a maintenance line for 3.0.x releases.  Is there a known
feature/improvement planned to go into branch-3.0 that is not desirable for
the 3.0.1 release?

I have found in the past that branching so early leads to many useful fixes
being unnecessarily postponed to future releases because committers forget
to pick to the new, relatively long-lived patch branch.  This becomes
especially true if blockers end up dragging out the ultimate release date,
which has historically been quite common.  My preference would be to cut
this branch as close to the RC as possible.

Jason


On Tue, Jan 9, 2018 at 1:17 PM, Lei Xu  wrote:

> Hi, All
>
> We have released Apache Hadoop 3.0.0 in December [1]. To further
> improve the quality of release, we plan to cut branch-3.0.1 branch
> tomorrow for the preparation of Apache Hadoop 3.0.1 release. The focus
> of 3.0.1 will be fixing blockers (3), critical bugs (1) and bug fixes
> [2].  No new features and improvement should be included.
>
> We plan to cut branch-3.0.1 tomorrow (Jan 10th) and vote for RC on Feb
> 1st, targeting for Feb 9th release.
>
> Please feel free to share your insights.
>
> [1] https://www.mail-archive.com/general@hadoop.apache.org/msg07757.html
> [2] https://issues.apache.org/jira/issues/?filter=12342842
>
> Best,
> --
> Lei (Eddy) Xu
> Software Engineer, Cloudera
>
> -
> To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org
>
>


[jira] [Created] (YARN-7721) TestContinuousScheduling fails sporadically with NPE

2018-01-09 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7721:


 Summary: TestContinuousScheduling fails sporadically with NPE
 Key: YARN-7721
 URL: https://issues.apache.org/jira/browse/YARN-7721
 Project: Hadoop YARN
  Issue Type: Bug
  Components: fairscheduler
Affects Versions: 3.1.0
Reporter: Jason Lowe


TestContinuousScheduling#testFairSchedulerContinuousSchedulingInitTime is 
failing sporadically with an NPE in precommit builds, and I can usually 
reproduce it locally after a few tries:
{noformat}
[ERROR] 
testFairSchedulerContinuousSchedulingInitTime(org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling)
  Time elapsed: 0.085 s  <<< ERROR!
java.lang.NullPointerException
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.TestContinuousScheduling.testFairSchedulerContinuousSchedulingInitTime(TestContinuousScheduling.java:383)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
[...]
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7703) Apps killed from the NEW state are not recorded in the state store

2018-01-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7703:


 Summary: Apps killed from the NEW state are not recorded in the 
state store
 Key: YARN-7703
 URL: https://issues.apache.org/jira/browse/YARN-7703
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe


While reviewing YARN-7663 I noticed that apps killed from the NEW state skip 
storing anything to the RM state store.  That means upon restart and recovery 
these apps will not be recovered, so they will simply disappear.  That could be 
surprising for users.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7700) TestContainerSchedulerQueuing sporadically fails in precommit builds

2018-01-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7700:


 Summary: TestContainerSchedulerQueuing sporadically fails in 
precommit builds
 Key: YARN-7700
 URL: https://issues.apache.org/jira/browse/YARN-7700
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.10.0
Reporter: Jason Lowe


TestContainerSchedulerQueuing#testKillOnlyRequiredOpportunisticContainers has 
been failing sporadically in precommit builds.  For example, from a branch-2 
precommit build:
{noformat}
java.lang.AssertionError: ContainerState is not correct (timedout)
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.assertTrue(Assert.java:41)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.BaseContainerManagerTest.waitForNMContainerState(BaseContainerManagerTest.java:390)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.BaseContainerManagerTest.waitForNMContainerState(BaseContainerManagerTest.java:362)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.TestContainerSchedulerQueuing.testKillOnlyRequiredOpportunisticContainers(TestContainerSchedulerQueuing.java:996)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7659) NodeManager metrics return wrong value after update resource

2017-12-15 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-7659.
--
Resolution: Duplicate

> NodeManager metrics return wrong value after update resource
> 
>
> Key: YARN-7659
> URL: https://issues.apache.org/jira/browse/YARN-7659
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: Yang Wang
>
> {code:title=NodeManagerMetrics.java}
>   public void addResource(Resource res) {
> availableMB = availableMB + res.getMemorySize();
> availableGB.incr((int)Math.floor(availableMB/1024d));
> availableVCores.incr(res.getVirtualCores());
>   }
> {code}
> When the node resource was updated through RM-NM heartbeat, the NM metric 
> will get wrong value. 
> The root cause of this issue is that new resource has been added to 
> availableMB, so not needed to increase for availableGB again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.7.5 (RC1)

2017-12-12 Thread Jason Lowe
Thanks for driving the release, Konstantin!

+1 (binding)

- Verified signatures and digests
- Successfully performed a native build from source
- Deployed a single-node cluster
- Ran some sample jobs and checked the logs

Jason


On Thu, Dec 7, 2017 at 9:22 PM, Konstantin Shvachko 
wrote:

> Hi everybody,
>
> I updated CHANGES.txt and fixed documentation links.
> Also committed  MAPREDUCE-6165, which fixes a consistently failing test.
>
> This is RC1 for the next dot release of Apache Hadoop 2.7 line. The
> previous one 2.7.4 was release August 4, 2017.
> Release 2.7.5 includes critical bug fixes and optimizations. See more
> details in Release Note:
> http://home.apache.org/~shv/hadoop-2.7.5-RC1/releasenotes.html
>
> The RC0 is available at: http://home.apache.org/~shv/hadoop-2.7.5-RC1/
>
> Please give it a try and vote on this thread. The vote will run for 5 days
> ending 12/13/2017.
>
> My up to date public key is available from:
> https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
>
> Thanks,
> --Konstantin
>


Re: [VOTE] Release Apache Hadoop 2.8.3 (RC0)

2017-12-12 Thread Jason Lowe
Thanks for driving this release, Junping!

+1 (binding)

- Verified signatures and digests
- Successfully performed native build from source
- Deployed a single-node cluster
- Ran some test jobs and examined the logs

Jason

On Tue, Dec 5, 2017 at 3:58 AM, Junping Du  wrote:

> Hi all,
>  I've created the first release candidate (RC0) for Apache Hadoop
> 2.8.3. This is our next maint release to follow up 2.8.2. It includes 79
> important fixes and improvements.
>
>   The RC artifacts are available at: http://home.apache.org/~
> junping_du/hadoop-2.8.3-RC0
>
>   The RC tag in git is: release-2.8.3-RC0
>
>   The maven artifacts are available via repository.apache.org at:
> https://repository.apache.org/content/repositories/orgapachehadoop-1072
>
>   Please try the release and vote; the vote will run for the usual 5
> working days, ending on 12/12/2017 PST time.
>
> Thanks,
>
> Junping
>


[jira] [Created] (YARN-7595) Container launching code suppresses close exceptions after writes

2017-12-01 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7595:


 Summary: Container launching code suppresses close exceptions 
after writes
 Key: YARN-7595
 URL: https://issues.apache.org/jira/browse/YARN-7595
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jason Lowe


There are a number of places in code related to container launching where the 
following pattern is used:
{code}
  try {
...write to stream outStream...
  } finally {
IOUtils.cleanupWithLogger(LOG, outStream);
  }
{code}

Unfortunately this suppresses any IOException that occurs during the close() 
method on outStream.  If the stream is buffered or could otherwise fail to 
finish writing the file when trying to close then this can lead to 
partial/corrupted data without throwing an I/O error.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7589) TestPBImplRecords fails with NullPointerException

2017-11-30 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7589:


 Summary: TestPBImplRecords fails with NullPointerException
 Key: YARN-7589
 URL: https://issues.apache.org/jira/browse/YARN-7589
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.1.0
Reporter: Jason Lowe


TestPBImplRecords is failing consistently in trunk:
{noformat}
[ERROR] Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 0.413 s 
<<< FAILURE! - in org.apache.hadoop.yarn.api.TestPBImplRecords
[ERROR] org.apache.hadoop.yarn.api.TestPBImplRecords  Time elapsed: 0.413 s  
<<< ERROR!
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.yarn.api.BasePBImplRecordsTest.generateByNewInstance(BasePBImplRecordsTest.java:151)
at 
org.apache.hadoop.yarn.api.TestPBImplRecords.setup(TestPBImplRecords.java:371)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
at 
org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
at 
org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
at 
org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
at 
org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119)
at 
org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)
Caused by: java.lang.NullPointerException
at 
org.apache.hadoop.yarn.util.resource.ResourceUtils.createResourceTypesArray(ResourceUtils.java:644)
at 
org.apache.hadoop.yarn.api.records.Resource.newInstance(Resource.java:105)
... 23 more
{noformat}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7583) Reduce overhead of container reacquisition

2017-11-29 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7583:


 Summary: Reduce overhead of container reacquisition
 Key: YARN-7583
 URL: https://issues.apache.org/jira/browse/YARN-7583
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: nodemanager
Reporter: Jason Lowe


When reacquiring containers after a nodemanager restart the Linux container 
executor invokes the container executor to essentially kill -0 the process to 
check if it is alive.  It would be a lot cheaper on Linux to stat the 
/proc/ directory which the nodemanager can do directly rather than pay for 
the fork-and-exec through the container executor and potential signal 
permission issues.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7576) Findbug warning for Resource exposing internal representation

2017-11-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7576:


 Summary: Findbug warning for Resource exposing internal 
representation
 Key: YARN-7576
 URL: https://issues.apache.org/jira/browse/YARN-7576
 Project: Hadoop YARN
  Issue Type: Bug
  Components: api
Affects Versions: 3.0.0
Reporter: Jason Lowe


Precommit builds are complaining about a findbugs warning:
{noformat}
EI  org.apache.hadoop.yarn.api.records.Resource.getResources() may expose 
internal representation by returning Resource.resources

Bug type EI_EXPOSE_REP (click for details)
In class org.apache.hadoop.yarn.api.records.Resource
In method org.apache.hadoop.yarn.api.records.Resource.getResources()
Field org.apache.hadoop.yarn.api.records.Resource.resources
At Resource.java:[line 213]

Returning a reference to a mutable object value stored in one of the object's 
fields exposes the internal representation of the object.  If instances are 
accessed by untrusted code, and unchecked changes to the mutable object would 
compromise security or other important properties, you will need to do 
something different. Returning a new copy of the object is better approach in 
many situations.
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.9.0 (RC3)

2017-11-17 Thread Jason Lowe
Thanks for putting this release together!

+1 (binding)

- Verified signatures and digests
- Successfully built from source including native
- Deployed to single-node cluster and ran some test jobs

Jason


On Mon, Nov 13, 2017 at 6:10 PM, Arun Suresh  wrote:

> Hi Folks,
>
> Apache Hadoop 2.9.0 is the first release of Hadoop 2.9 line and will be the
> starting release for Apache Hadoop 2.9.x line - it includes 30 New Features
> with 500+ subtasks, 407 Improvements, 790 Bug fixes new fixed issues since
> 2.8.2.
>
> More information about the 2.9.0 release plan can be found here:
> *https://cwiki.apache.org/confluence/display/HADOOP/
> Roadmap#Roadmap-Version2.9
>  Roadmap#Roadmap-Version2.9>*
>
> New RC is available at: *https://home.apache.org/~
> asuresh/hadoop-2.9.0-RC3/
> *
>
> The RC tag in git is: release-2.9.0-RC3, and the latest commit id is:
> 756ebc8394e473ac25feac05fa493f6d612e6c50.
>
> The maven artifacts are available via repository.apache.org at:
>  apache.org%2Fcontent%2Frepositories%2Forgapachehadoop-1066=D&
> sntz=1=AFQjCNFcern4uingMV_sEreko_zeLlgdlg>*https://
> repository.apache.org/content/repositories/orgapachehadoop-1068/
>  >*
>
> We are carrying over the votes from the previous RC given that the delta is
> the license fix.
>
> Given the above - we are also going to stick with the original deadline for
> the vote : ending on Friday 17th November 2017 2pm PT time.
>
> Thanks,
> -Arun/Subru
>


[jira] [Created] (YARN-7502) Nodemanager restart docs should describe nodemanager supervised property

2017-11-15 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7502:


 Summary: Nodemanager restart docs should describe nodemanager 
supervised property
 Key: YARN-7502
 URL: https://issues.apache.org/jira/browse/YARN-7502
 Project: Hadoop YARN
  Issue Type: Bug
  Components: documentation
Affects Versions: 2.7.4, 2.9.0, 2.8.2, 3.0.0
Reporter: Jason Lowe


The yarn.nodemanager.recovery.supervised property is not mentioned in the 
nodemanager restart documentation.  The docs should describe what this property 
does and when it is useful to set it to a value different than the 
work-preserving restart property.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7455) add_mounts can overrun temporary buffer

2017-11-07 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7455:


 Summary: add_mounts can overrun temporary buffer
 Key: YARN-7455
 URL: https://issues.apache.org/jira/browse/YARN-7455
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.9.0, 3.0.0
Reporter: Jason Lowe


While reviewing YARN-7197 I noticed that add_mounts in docker_util.c has a 
potential buffer overflow since tmp_buffer is only 1024 bytes which may not be 
sufficient to hold the specified mount path.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7454) RMAppAttemptMetrics#getAggregate can NPE due to double lookup

2017-11-07 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7454:


 Summary: RMAppAttemptMetrics#getAggregate can NPE due to double 
lookup
 Key: YARN-7454
 URL: https://issues.apache.org/jira/browse/YARN-7454
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7433) java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.

2017-11-06 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-7433.
--
Resolution: Invalid

Closing this since this is a user issue with the build and/or deployment of 
Hadoop and not a bug in Hadoop itself.


> java.lang.RuntimeException: native snappy library not available: this version 
> of libhadoop was built without snappy support.
> 
>
> Key: YARN-7433
> URL: https://issues.apache.org/jira/browse/YARN-7433
> Project: Hadoop YARN
>  Issue Type: Bug
>Reporter: gehaijiang
>Priority: Trivial
>
> From centos6.5 upgrade centos7,hadoop version(2.7.1) is compiled on 
> centos6.5 and Support snappy,   the copy runs on centos7.
> but  yarn task error ,yarn  task errorlog: 
> (hadoop Native Libraries Whether to recompile based on centos7???)
> Error: java.lang.RuntimeException: native snappy library not available: this 
> version of libhadoop was built without snappy support.
>   at 
> org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65)
>   at 
> org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134)
>   at 
> org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150)
>   at 
> org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:165)
>   at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:114)
>   at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:97)
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1856)
>   at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1511)
>   at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:723)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:793)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:415)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Error: java.io.IOException: Spill failed



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.8.2 (RC1)

2017-10-23 Thread Jason Lowe
+1 (binding)

- Verified signatures and digests
- Performed a native build from source
- Deployed to a single-node cluster
- Ran some sample jobs

The CHANGES.md and RELEASENOTES.md both refer to release 2.8.0 instead of
2.8.2, and I do not see the list of JIRAs in CHANGES.md that have been
committed since 2.8.1.  Since we're voting on the source bits rather than
the change log I kept my vote as a +1 as I do see the 2.8.2 changes in the
source code.

Jason


On Thu, Oct 19, 2017 at 7:42 PM, Junping Du  wrote:

> Hi folks,
>  I've created our new release candidate (RC1) for Apache Hadoop 2.8.2.
>
>  Apache Hadoop 2.8.2 is the first stable release of Hadoop 2.8 line
> and will be the latest stable/production release for Apache Hadoop - it
> includes 315 new fixed issues since 2.8.1 and 69 fixes are marked as
> blocker/critical issues.
>
>   More information about the 2.8.2 release plan can be found here:
> https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release
>
>   New RC is available at: http://home.apache.org/~
> junping_du/hadoop-2.8.2-RC1 du/hadoop-2.8.2-RC0>
>
>   The RC tag in git is: release-2.8.2-RC1, and the latest commit id
> is: 66c47f2a01ad9637879e95f80c41f798373828fb
>
>   The maven artifacts are available via repository.apache.org repository.apache.org/> at: https://repository.apache.org/
> content/repositories/orgapachehadoop-1064 repository.apache.org/content/repositories/orgapachehadoop-1062>
>
>   Please try the release and vote; the vote will run for the usual 5
> days, ending on 10/24/2017 6pm PST time.
>
> Thanks,
>
> Junping
>
>


[jira] [Created] (YARN-7333) container-executor fails to remove entries from a directory that is not writable or executable

2017-10-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7333:


 Summary: container-executor fails to remove entries from a 
directory that is not writable or executable
 Key: YARN-7333
 URL: https://issues.apache.org/jira/browse/YARN-7333
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 3.0.0-alpha1, 2.9.0, 2.8.2
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Critical


Similar to the situation from YARN-4594, container-executor will fail to 
cleanup directories that do not have write and execute permissions for the 
directory.  YARN-4594 fixed the scenario where the directory is not readable, 
but it missed the case where we can open the directory but either not traverse 
it (i.e.: no execute permission) or cannot remove entries from within it (i.e.: 
no write permissions).




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7319) java.net.UnknownHostException when trying contact node by hostname

2017-10-12 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-7319.
--
Resolution: Invalid

JIRA is for tracking features and defects in Apache Hadoop and not for general 
user support.  This is not a problem in Apache Hadoop but rather a problem with 
the network setup of the hosts.  See 
https://wiki.apache.org/hadoop/InvalidJiraIssues and 
https://wiki.apache.org/hadoop/UnknownHost for more details.  I highly 
recommend asking this on the [Hadoop user mailing 
list|http://hadoop.apache.org/mailing_lists.html#User] where hopefully someone 
with experience setting up a Kubernetes cluster to run Apache Hadoop can assist 
you.


> java.net.UnknownHostException when trying contact node by hostname
> --
>
> Key: YARN-7319
> URL: https://issues.apache.org/jira/browse/YARN-7319
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Reporter: Evgeny Makarov
>
> I'm trying to setup Hadoop on Kubernetes cluster with following setup:
> Hadoop master is k8s pod
> Each hadoop slave is additional k8s pod
> All communication is being processed on IP based manned. In HDFS I have 
> setting of dfs.namenode.datanode.registration.ip-hostname-check set to false 
> and all works fine, however same option missing for YARN manager. 
> Here part of hadoop-master log when trying to submit simple word-count job:
> 2017-10-12 09:00:25,005 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt:
>  Error trying to assign container token and NM token to an allocated 
> container container_1507798393049_0001_01_01
> java.lang.IllegalArgumentException: java.net.UnknownHostException: 
> hadoop-slave-743067341-hqrbk
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:377)
> at 
> org.apache.hadoop.yarn.server.utils.BuilderUtils.newContainerToken(BuilderUtils.java:258)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.security.RMContainerTokenSecretManager.createContainerToken(RMContainerTokenSecretManager.java:220)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.pullNewlyAllocatedContainersAndNMTokens(SchedulerApplicationAttempt.java:454)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.getAllocation(FiCaSchedulerApp.java:269)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocate(CapacityScheduler.java:988)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:971)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl$AMContainerAllocatedTransition.transition(RMAppAttemptImpl.java:964)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:385)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:789)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:105)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:795)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:776)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:183)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:109)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: hadoop-slave-743067341-hqrbk
> ... 19 more
> As can be seen, host hadoop-slave-743067341-hqrbk is unreachable. Adding 
> record to /ets/hosts of master will solve the problem, however its not an 
> option in Kubernetes environment. There is should be a way to resolve nodes 
> by IP address



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7299) TestDistributedScheduler is failing

2017-10-09 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7299:


 Summary: TestDistributedScheduler is failing
 Key: YARN-7299
 URL: https://issues.apache.org/jira/browse/YARN-7299
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe


TestDistributedScheduler has been failing consistently in trunk:
{noformat}
Running 
org.apache.hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.75 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler
testDistributedScheduler(org.apache.hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler)
  Time elapsed: 0.67 sec  <<< FAILURE!
java.lang.AssertionError: expected:<4> but was:<2>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.nodemanager.scheduler.TestDistributedScheduler.testDistributedScheduler(TestDistributedScheduler.java:118)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-7288) ContainerLocalizer with multiple JVM Options

2017-10-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-7288.
--
Resolution: Invalid

> ContainerLocalizer with multiple JVM Options
> 
>
> Key: YARN-7288
> URL: https://issues.apache.org/jira/browse/YARN-7288
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: nodemanager
>Affects Versions: 2.7.3
>Reporter: Prabhu Joseph
>Assignee: Prabhu Joseph
>
> Currently ContaninerLocalizer can be configured with a single JVM option 
> through yarn.nodemanager.container-localizer.java.opts. There are cases where 
> we need more than one like adding -Dlog4j.debug / -verbose to debug issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7285) ContainerExecutor always launches with priorities due to yarn-default property

2017-10-03 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7285:


 Summary: ContainerExecutor always launches with priorities due to 
yarn-default property
 Key: YARN-7285
 URL: https://issues.apache.org/jira/browse/YARN-7285
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.0, 2.9.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Minor


ContainerExecutor will launch containers with a specified priority if a 
priority adjustment is specified, otherwise with the OS default priority if it 
is unspecified.  YARN-3069 added 
yarn.nodemanager.container-executor.os.sched.priority.adjustment to 
yarn-default.xml, so it is always specified even if the user did not explicitly 
set it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7248) NM returns new SCHEDULED container status to older clients

2017-09-25 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7248:


 Summary: NM returns new SCHEDULED container status to older clients
 Key: YARN-7248
 URL: https://issues.apache.org/jira/browse/YARN-7248
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0-alpha2, 2.9.0
Reporter: Jason Lowe
Priority: Blocker


YARN-4597 added a new SCHEDULED container state and that state is returned to 
clients when the container is localizing, etc.  However the client may be 
running on an older software version that does not have the new SCHEDULED state 
which could lead the client to crash on the unexpected container state value or 
make incorrect assumptions like any state != NEW and != RUNNING must be 
COMPLETED which was true in the older version.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7226) Whitelisted variables do not support delayed variable expansion

2017-09-20 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7226:


 Summary: Whitelisted variables do not support delayed variable 
expansion
 Key: YARN-7226
 URL: https://issues.apache.org/jira/browse/YARN-7226
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 3.0.0-alpha4, 2.8.1, 2.9.0
Reporter: Jason Lowe
Assignee: Jason Lowe


The nodemanager supports a configurable list of environment variables, via 
yarn.nodemanager.env-whitelist, that will be propagated to the container's 
environment unless those variables were specified in the container launch 
context.  Unfortunately the handling of these whitelisted variables prevents 
using delayed variable expansion.  For example, if a user shipped their own 
version of hadoop with their job via the distributed cache and specified:
{noformat}
HADOOP_COMMON_HOME={{PWD}}/my-private-hadoop/
{noformat}
 as part of their job, the variable will be set as the *literal* string:
{noformat}
$PWD/my-private-hadoop/
{noformat}
rather than having $PWD expand to the container's current directory as it does 
for any other, non-whitelisted variable being set to the same value.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7164) TestAMRMClientOnRMRestart fails sporadically with bind address in use

2017-09-06 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7164:


 Summary: TestAMRMClientOnRMRestart fails sporadically with bind 
address in use
 Key: YARN-7164
 URL: https://issues.apache.org/jira/browse/YARN-7164
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe
Assignee: Jason Lowe


Saw a bind address in use exception in 
TestAMRMClientOnRMRestart#testAMRMClientOnAMRMTokenRollOverOnRMRestart on 
Hadoop 2.8.  The error looks similar to YARN-4251, but that fix was already 
present.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Merge feature branch YARN-5355 (Timeline Service v2) to trunk

2017-08-29 Thread Jason Lowe
+1 (binding)

I participated in the review for the reader authorization and verified that
ATSv2 has no significant impact when disabled.  Looking forward to seeing
the next increment in functionality in a release.  A big thank you to
everyone involved in this effort!

Jason


On Tue, Aug 22, 2017 at 1:32 AM, Vrushali Channapattan <
vrushalic2...@gmail.com> wrote:

> Hi folks,
>
> Per earlier discussion [1], I'd like to start a formal vote to merge
> feature branch YARN-5355 [2] (Timeline Service v.2) to trunk. The vote will
> run for 7 days, and will end August 29 11:00 PM PDT.
>
> We have previously completed one merge onto trunk [3] and Timeline Service
> v2 has been part of Hadoop release 3.0.0-alpha1.
>
> Since then, we have been working on extending the capabilities of Timeline
> Service v2 in a feature branch [2] for a while, and we are reasonably
> confident that the state of the feature meets the criteria to be merged
> onto trunk and we'd love folks to get their hands on it in a test capacity
> and provide valuable feedback so that we can make it production-ready.
>
> In a nutshell, Timeline Service v.2 delivers significant scalability and
> usability improvements based on a new architecture. What we would like to
> merge to trunk is termed "alpha 2" (milestone 2). The feature has a
> complete end-to-end read/write flow with security and read level
> authorization via whitelists. You should be able to start setting it up and
> testing it.
>
> At a high level, the following are the key features that have been
> implemented since alpha1:
> - Security via Kerberos Authentication and delegation tokens
> - Read side simple authorization via whitelist
> - Client configurable entity sort ordering
> - Richer REST APIs for apps, app attempts, containers, fetching metrics by
> timerange, pagination, sub-app entities
> - Support for storing sub-application entities (entities that exist outside
> the scope of an application)
> - Configurable TTLs (time-to-live) for tables, configurable table prefixes,
> configurable hbase cluster
> - Flow level aggregations done as dynamic (table level) coprocessors
> - Uses latest stable HBase release 1.2.6
>
> There are a total of 82 subtasks that were completed as part of this
> effort.
>
> We paid close attention to ensure that once disabled Timeline Service v.2
> does not impact existing functionality when disabled (by default).
>
> Special thanks to a team of folks who worked hard and contributed towards
> this effort with patches, reviews and guidance: Rohith Sharma K S, Varun
> Saxena, Haibo Chen, Sangjin Lee, Li Lu, Vinod Kumar Vavilapalli, Joep
> Rottinghuis, Jason Lowe, Jian He, Robert Kanter, Micheal Stack.
>
> Regards,
> Vrushali
>
> [1] http://www.mail-archive.com/yarn-dev@hadoop.apache.org/msg27383.html
> [2] https://issues.apache.org/jira/browse/YARN-5355
> [3] https://issues.apache.org/jira/browse/YARN-2928
> [4] https://github.com/apache/hadoop/commits/YARN-5355
>


[jira] [Resolved] (YARN-7110) NodeManager always crash for spark shuffle service out of memory

2017-08-28 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-7110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-7110.
--
Resolution: Duplicate

> NodeManager always crash for spark shuffle service out of memory
> 
>
> Key: YARN-7110
> URL: https://issues.apache.org/jira/browse/YARN-7110
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: nodemanager
>Reporter: YunFan Zhou
>Priority: Critical
> Attachments: screenshot-1.png
>
>
> NM often crash due to the Spark shuffle service,  I can saw many error log 
> messages before NM crashed:
> {noformat}
> 2017-08-28 16:14:20,521 ERROR 
> org.apache.spark.network.server.TransportRequestHandler: Error sending result 
> ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=79124460, 
> chunkIndex=0}, 
> buffer=FileSegmentManagedBuffer{file=/data11/hadoopdata/nodemanager/local/usercache/map_loc/appcache/application_1502793246072_2171283/blockmgr-11e2d625-8db1-477c-9365-4f6d0a7d1c48/10/shuffle_0_6_0.data,
>  offset=27063401500, length=64785602}} to /10.93.91.17:18958; closing 
> connection
> java.io.IOException: Broken pipe
> at sun.nio.ch.FileChannelImpl.transferTo0(Native Method)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectlyInternal(FileChannelImpl.java:428)
> at 
> sun.nio.ch.FileChannelImpl.transferToDirectly(FileChannelImpl.java:493)
> at sun.nio.ch.FileChannelImpl.transferTo(FileChannelImpl.java:608)
> at 
> org.apache.spark.network.buffer.LazyFileRegion.transferTo(LazyFileRegion.java:96)
> at 
> org.apache.spark.network.protocol.MessageWithHeader.transferTo(MessageWithHeader.java:92)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWriteFileRegion(NioSocketChannel.java:254)
> at 
> io.netty.channel.nio.AbstractNioByteChannel.doWrite(AbstractNioByteChannel.java:237)
> at 
> io.netty.channel.socket.nio.NioSocketChannel.doWrite(NioSocketChannel.java:281)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:761)
> at 
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.forceFlush(AbstractNioChannel.java:317)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:519)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at 
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> at java.lang.Thread.run(Thread.java:745)
> 2017-08-28 16:14:20,523 ERROR 
> org.apache.spark.network.server.TransportRequestHandler: Error sending result 
> RpcResponse{requestId=7652091066050104512, 
> body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=13 cap=13]}} to 
> /10.93.91.17:18958; closing connection
> {noformat}
> Finally, there are too many *Finalizer* objects in the process of *NM* to 
> cause OOM.
> !screenshot-1.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [DISCUSS] Branches and versions for Hadoop 3

2017-08-28 Thread Jason Lowe
Allen Wittenauer wrote:


> > On Aug 25, 2017, at 1:23 PM, Jason Lowe <jl...@oath.com> wrote:
> >
> > Allen Wittenauer wrote:
> >
> > > Doesn't this place an undue burden on the contributor with the first
> incompatible patch to prove worthiness?  What happens if it is decided that
> it's not good enough?
> >
> > It is a burden for that first, "this can't go anywhere else but 4.x"
> change, but arguably that should not be a change done lightly anyway.  (Or
> any other backwards-incompatible change for that matter.)  If it's worth
> committing then I think it's perfectly reasonable to send out the dev
> announce that there's reason for trunk to diverge from 3.x, cut branch-3,
> and move on.  This is no different than Andrew's recent announcement that
> there's now a need for separating trunk and the 3.0 line based on what's
> about to go in.
>
> So, by this definition as soon as a patch comes in to remove
> deprecated bits there will be no issue with a branch-3 getting created,
> correct?
>

I think this gets back to the "if it's worth committing" part.  I feel the
community should collectively decide when it's worth taking the hit to
maintain the separate code line.  IMHO removing deprecated bits alone is
not reason enough to diverge the code base and the additional maintenance
that comes along with the extra code line.  A new feature is traditionally
the reason to diverge because that's something users would actually care
enough about to take the compatibility hit when moving to the version that
has it.  That also helps drive a timely release of the new code line
because users want the feature that went into it.


> >  Otherwise if past trunk behavior is any indication, it ends up mostly
> enabling people to commit to just trunk, forgetting that the thing they are
> committing is perfectly valid for branch-3.
>
> I'm not sure there was any "forgetting" involved.  We likely
> wouldn't be talking about 3.x at all if it wasn't for the code diverging
> enough.
>

I don't think it was the myriad of small patches that went only into trunk
over the last 6 years that drove this.  Instead I think it was simply that
an "important enough" feature went in, like erasure coding, that gathered
momentum behind this release.  Trunk sat ignored for basically 5+ years,
and plenty of patches went into just trunk that should have gone into at
least branch-2 as well.  I don't think we as a community did the
contributors any favors by putting their changes into a code line that
didn't see a release for a very long time.  Yes 3.x could have released
sooner to help solve that issue, but given the complete lack of excitement
around 3.x until just recently is there any reason this won't happen again
with 4.x?  Seems to me 4.x will need to have something "interesting enough"
to drive people to release it relative to 3.x, which to me indicates we
shouldn't commit things only to there until we have an interest to do so.

> > Given the number of committers that openly ignore discussions like
> this, who is going to verify that incompatible changes don't get in?
> >
> > The same entities who are verifying other bugs don't get in, i.e.: the
> committers and the Hadoop QA bot running the tests.
> >  Yes, I know that means it's inevitable that compatibility breakages
> will happen, and we can and should improve the automation around
> compatibility testing when possible.
>
> The automation only goes so far.  At least while investigating
> Yetus bugs, I've seen more than enough blatant and purposeful ignored
> errors and warnings that I'm not convinced it will be effective. ("That
> javadoc compile failure didn't come from my patch!"  Um, yes, yes it did.)
> PR for features has greatly trumped code correctness for a few years now.
>

I totally agree here.  We can and should do better about this outside of
automation.  I brought up automation since I see it as a useful part of the
total solution along with better developer education, oversight, etc.  I'm
thinking specifically about tools that can report on public API signature
changes, but that's just one aspect of compatibility.  Semantic behavior is
not something a static analysis tool can automatically detect, and the only
way to automate some of that is something like end-to-end compatibility
testing.  Bigtop may cover some of this with testing of older versions of
downstream projects like HBase, Hive, Oozie, etc., and we could setup some
tests that standup two different Hadoop clusters and run tests that verify
interop between them.  But the tests will never be exhaustive and we will
still need educated committers and oversight to fill in the gaps.

>  But I don't think there's a magic bullet for preventing all
> compatibi

[jira] [Created] (YARN-7112) TestAMRMProxy is failing with invalid request

2017-08-28 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7112:


 Summary: TestAMRMProxy is failing with invalid request
 Key: YARN-7112
 URL: https://issues.apache.org/jira/browse/YARN-7112
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.9.0, 3.0.0-beta1, 2.8.2
Reporter: Jason Lowe
Assignee: Jason Lowe


The testAMRMProxyE2E and testAMRMProxyTokenRenewal tests in TestAMRMProxy are 
failing:
{noformat}
org.apache.hadoop.yarn.exceptions.InvalidApplicationMasterRequestException: 
Invalid responseId in AllocateRequest from application attempt: 
appattempt_1503933047334_0001_01, expect responseId to be 0, but get 1
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [DISCUSS] Branches and versions for Hadoop 3

2017-08-25 Thread Jason Lowe
Allen Wittenauer wrote:


> Doesn't this place an undue burden on the contributor with the first
> incompatible patch to prove worthiness?  What happens if it is decided that
> it's not good enough?


It is a burden for that first, "this can't go anywhere else but 4.x"
change, but arguably that should not be a change done lightly anyway.  (Or
any other backwards-incompatible change for that matter.)  If it's worth
committing then I think it's perfectly reasonable to send out the dev
announce that there's reason for trunk to diverge from 3.x, cut branch-3,
and move on.  This is no different than Andrew's recent announcement that
there's now a need for separating trunk and the 3.0 line based on what's
about to go in.

I do not think it makes sense to pay for the maintenance overhead of two
nearly-identical lines with no backwards-incompatible changes between them
until we have the need.  Otherwise if past trunk behavior is any
indication, it ends up mostly enabling people to commit to just trunk,
forgetting that the thing they are committing is perfectly valid for
branch-3.  If we can agree that trunk and branch-3 should be equivalent
until an incompatible change goes into trunk, why pay for the commit
overhead and potential for accidentally missed commits until it is really
necessary?

How many will it take before the dam will break?  Or is there a timeline
> going to be given before trunk gets set to 4.x?


I think the threshold count for the dam should be 1.  As soon as we have a
JIRA that needs to be committed to move the project forward and we cannot
ship it in a 3.x release then we create branch-3 and move trunk to 4.x.
As for a timeline going to 4.x, again I don't see it so much as a "baking
period" as a "when we need it" criteria.  If we need it in a week then we
should cut it in a week.  Or a year then a year.  It all depends upon when
that 4.x-only change is ready to go in.

Given the number of committers that openly ignore discussions like this,
> who is going to verify that incompatible changes don't get in?
>

The same entities who are verifying other bugs don't get in, i.e.: the
committers and the Hadoop QA bot running the tests.  Yes, I know that means
it's inevitable that compatibility breakages will happen, and we can and
should improve the automation around compatibility testing when possible.
But I don't think there's a magic bullet for preventing all compatibility
bugs from being introduced, just like there isn't one for preventing
general bugs.  Does having a trunk branch separate but essentially similar
to branch-3 make this any better?

Longer term:  what is the PMC doing to make sure we start doing major
> releases in a timely fashion again?  In other words, is this really an
> issue if we shoot for another major in (throws dart) 2 years?
>

If we're trying to do semantic versioning then we shouldn't have a regular
cadence for major releases unless we have a regular cadence of changes that
break compatibility.  I'd hope that's not something we would strive
towards.  I do agree that we should try to be better about shipping
releases, major or minor, in a more timely manner, but I don't agree that
we should cut 4.0 simply based on a duration since the last major release.
The release contents and community's desire for those contents should
dictate the release numbering and schedule, respectively.

Jason


On Fri, Aug 25, 2017 at 2:16 PM, Allen Wittenauer 
wrote:

>
> > On Aug 25, 2017, at 10:36 AM, Andrew Wang 
> wrote:
>
> > Until we need to make incompatible changes, there's no need for
> > a Hadoop 4.0 version.
>
> Some questions:
>
> Doesn't this place an undue burden on the contributor with the
> first incompatible patch to prove worthiness?  What happens if it is
> decided that it's not good enough?
>
> How many will it take before the dam will break?  Or is there a
> timeline going to be given before trunk gets set to 4.x?
>
> Given the number of committers that openly ignore discussions like
> this, who is going to verify that incompatible changes don't get in?
>
> Longer term:  what is the PMC doing to make sure we start doing
> major releases in a timely fashion again?  In other words, is this really
> an issue if we shoot for another major in (throws dart) 2 years?
> -
> To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
> For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org
>
>


Re: Branch merges and 3.0.0-beta1 scope

2017-08-25 Thread Jason Lowe
Andrew Wang wrote:


> This means I'll cut branch-3 and
> branch-3.0, and move trunk to 4.0.0 before these VOTEs end. This will open
> up development for Hadoop 3.1.0 and 4.0.0.


I can see a need for branch-3.0, but please do not create branch-3.  Doing
so will relegate trunk back to the "patch purgatory" branch, a place where
patches won't see a release for years.  Unless something is imminently
going in that will break backwards compatibility and warrant a new 4.x
release, I don't see the need to distinguish trunk from the 3.x line.
Leaving trunk as the 3.x line means less branches to commit patches through
and more testing of every patch since trunk would remain an active area for
testing and releasing.  If we separate trunk and branch-3 then it's almost
certain only-trunk patches will start to accumulate and never get any
"real" testing until someone eventually decides it's time to go to Hadoop
4.x.  Looking back at trunk-as-3.x for an example, patches committed there
in the early days after branch-2 was cut didn't see a release for almost 6
years.

My apologies if I've missed a feature that is just going to miss the 3.0
release and will break compatibility when it goes in.  If so then we need
to cut branch-3, but if not then here's my plea to hold off until we do
need it.

Jason


On Thu, Aug 24, 2017 at 3:33 PM, Andrew Wang 
wrote:

> Glad to see the discussion continued in my absence :)
>
> From a release management perspective, it's *extremely* reasonable to block
> the inclusion of new features a month from the planned release date. A
> typical software development lifecycle includes weeks of feature freeze and
> weeks of code freeze. It is no knock on any developer or any feature to say
> that we should not include something in 3.0.0.
>
> I've been very open and clear about the goals, schedule, and scope of 3.0.0
> over the last year plus. The point of the extended alpha process was to get
> all our features in during alpha, and the alpha merge window has been open
> for a year. I'm unmoved by arguments about how long a feature has been
> worked on. None of these were not part of the original 3.0.0 scope, and our
> users have been waiting even longer for big-ticket 3.0 items like JDK8 and
> HDFS EC that were part of the discussed scope.
>
> I see that two VOTEs have gone out since I was out. I still plan to follow
> the proposal in my original email. This means I'll cut branch-3 and
> branch-3.0, and move trunk to 4.0.0 before these VOTEs end. This will open
> up development for Hadoop 3.1.0 and 4.0.0.
>
> I'm reaching out to the lead contributor of each of these features
> individually to discuss. We need to close on this quickly, and email is too
> low bandwidth at this stage.
>
> Best,
> Andrew
>


[jira] [Created] (YARN-7087) NM failed to perform log aggregation due to absent container

2017-08-23 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7087:


 Summary: NM failed to perform log aggregation due to absent 
container
 Key: YARN-7087
 URL: https://issues.apache.org/jira/browse/YARN-7087
 Project: Hadoop YARN
  Issue Type: Bug
  Components: log-aggregation
Affects Versions: 2.8.1
Reporter: Jason Lowe
Priority: Critical


Saw a case where the NM failed to aggregate the logs for a container because it 
claimed it was absent:
{noformat}
2017-08-23 18:35:38,283 [AsyncDispatcher event handler] WARN 
logaggregation.LogAggregationService: Log aggregation cannot be started for 
container_e07_1503326514161_502342_01_01, as its an absent container
{noformat}

Containers should not be allowed to disappear if they're not done being fully 
processed by the NM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7084) TestSchedulingMonitor#testRMStarts fails sporadically

2017-08-23 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7084:


 Summary: TestSchedulingMonitor#testRMStarts fails sporadically
 Key: YARN-7084
 URL: https://issues.apache.org/jira/browse/YARN-7084
 Project: Hadoop YARN
  Issue Type: Bug
Reporter: Jason Lowe


TestSchedulingMonitor has been failing sporadically in precommit builds.  
Failures look like this:
{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.802 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor
testRMStarts(org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor)
  Time elapsed: 1.728 sec  <<< FAILURE!
org.mockito.exceptions.verification.WantedButNotInvoked: 
Wanted but not invoked:
schedulingEditPolicy.editSchedule();
-> at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)

However, there were other interactions with this mock:
-> at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.(SchedulingMonitor.java:50)
-> at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:61)
-> at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.serviceInit(SchedulingMonitor.java:62)

at 
org.apache.hadoop.yarn.server.resourcemanager.monitor.TestSchedulingMonitor.testRMStarts(TestSchedulingMonitor.java:58)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7019) Ability for applications to notify YARN about container reuse

2017-08-15 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7019:


 Summary: Ability for applications to notify YARN about container 
reuse
 Key: YARN-7019
 URL: https://issues.apache.org/jira/browse/YARN-7019
 Project: Hadoop YARN
  Issue Type: New Feature
Reporter: Jason Lowe


During preemption calculations YARN can try to reduce the amount of work lost 
by considering how long a container has been running.  However when an 
application framework like Tez reuses a container across multiple tasks it 
changes the work lost calculation since the container has essentially 
checkpointed between task assignments.  It would be nice if applications could 
inform YARN when a container has been reused/checkpointed and therefore is a 
better candidate for preemption wrt. lost work than other, younger containers.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-7018) Interface for adding extra behavior to node heartbeats

2017-08-15 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-7018:


 Summary: Interface for adding extra behavior to node heartbeats
 Key: YARN-7018
 URL: https://issues.apache.org/jira/browse/YARN-7018
 Project: Hadoop YARN
  Issue Type: New Feature
  Components: resourcemanager
Reporter: Jason Lowe
Assignee: Jason Lowe


This JIRA tracks an interface for plugging in new behavior to node heartbeat 
processing.  Adding a formal interface for additional node heartbeat processing 
would allow admins to configure new functionality that is scheduler-independent 
without needing to replace the entire scheduler.  For example, both YARN-5202 
and YARN-5215 had approaches where node heartbeat processing was extended to 
implement new functionality that was essentially scheduler-independent and 
could be implemented as a plugin with this interface.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.7.4 (RC0)

2017-08-02 Thread Jason Lowe
Thanks for driving the 2.7.4 release!
+1 (binding)
- Verified signatures and digests- Successfully built from source including 
native- Deployed to a single-node cluster and ran sample MapReduce jobs
Jason 

On Saturday, July 29, 2017 6:29 PM, Konstantin Shvachko 
 wrote:
 

 Hi everybody,

Here is the next release of Apache Hadoop 2.7 line. The previous stable
release 2.7.3 was available since 25 August, 2016.
Release 2.7.4 includes 264 issues fixed after release 2.7.3, which are
critical bug fixes and major optimizations. See more details in Release
Note:
http://home.apache.org/~shv/hadoop-2.7.4-RC0/releasenotes.html

The RC0 is available at: http://home.apache.org/~shv/hadoop-2.7.4-RC0/

Please give it a try and vote on this thread. The vote will run for 5 days
ending 08/04/2017.

Please note that my up to date public key are available from:
https://dist.apache.org/repos/dist/release/hadoop/common/KEYS
Please don't forget to refresh the page if you've been there recently.
There are other place on Apache sites, which may contain my outdated key.

Thanks,
--Konstantin


   

[jira] [Created] (YARN-6917) Queue path is recomputed from scratch on every allocation

2017-08-01 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6917:


 Summary: Queue path is recomputed from scratch on every allocation
 Key: YARN-6917
 URL: https://issues.apache.org/jira/browse/YARN-6917
 Project: Hadoop YARN
  Issue Type: Improvement
  Components: capacityscheduler
Affects Versions: 2.8.1
Reporter: Jason Lowe
Priority: Minor


As part of the discussion in YARN-6901 I noticed that we are recomputing a 
queue's path for every allocation.  Currently getting the queue's path involves 
calling getQueuePath on the parent then building onto that string with the 
basename of the queue.  In turn the parent's getQueuePath method does the same, 
so we end up spending time recomputing a string that will never change until a 
reconfiguration.

Ideally the queue path should be computed once during queue initialization 
rather than on-demand.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [Vote] merge feature branch YARN-2915 (Federation) to trunk

2017-07-31 Thread Jason Lowe
+1
Jason
 

On Tuesday, July 25, 2017 10:24 PM, Subru Krishnan  wrote:
 

 Hi all,

Per earlier discussion [9], I'd like to start a formal vote to merge
feature YARN Federation (YARN-2915) [1] to trunk. The vote will run for 7
days, and will end Aug 1 7PM PDT.

We have been developing the feature in a branch (YARN-2915 [2]) for a
while, and we are reasonably confident that the state of the feature meets
the criteria to be merged onto trunk.

*Key Ideas*:

YARN’s centralized design allows strict enforcement of scheduling
invariants and effective resource sharing, but becomes a scalability
bottleneck (in number of jobs and nodes) well before reaching the scale of
our clusters (e.g., 20k-50k nodes).


To address these limitations, we developed a scale-out, federation-based
solution (YARN-2915). Our architecture scales near-linearly to datacenter
sized clusters, by partitioning nodes across multiple sub-clusters (each
running a YARN cluster of few thousands nodes). Applications can span
multiple sub-clusters *transparently (i.e. no code change or recompilation
of existing apps)*, thanks to a layer of indirection that negotiates with
multiple sub-clusters' Resource Managers on behalf of the application.


This design is structurally scalable, as it bounds the number of nodes each
RM is responsible for. Appropriate policies ensure that the majority of
applications reside within a single sub-cluster, thus further controlling
the load on each RM. This provides near linear scale-out by simply adding
more sub-clusters. The same mechanism enables pooling of resources from
clusters owned and operated by different teams.

Status:

  - The version we would like to merge to trunk is termed "MVP" (minimal
  viable product). The feature will have a complete end-to-end application
  execution flow with the ability to span a single application across
  multiple YARN (sub) clusters.
  - There were 50+ sub-tasks that were that were completed as part of this
  effort. Every patch has been reviewed and +1ed by a committer. Thanks to
  Jian, Wangda, Karthik, Vinod, Varun & Arun for the thorough reviews!
  - Federation is designed to be built around YARN and consequently has
  minimal code changes to core YARN. The relevant JIRAs that modify existing
  YARN code base are YARN-3671 [7] & YARN-3673 [8]. We also paid close
  attention to ensure that if federation is disabled there is zero impact to
  existing functionality (disabled by default).
  - We found a few bugs as we went along which we fixed directly upstream
  in trunk and/or branch-2.
  - We have continuously rebasing the feature branch [2] so the merge
  should be a straightforward cherry-pick.
  - The current version has been rather thoroughly tested and is currently
  deployed in a *10,000+ node federated YARN cluster that's running
  upwards of 50k jobs daily with a reliability of 99.9%*.
  - We have few ideas for follow-up extensions/improvements which are
  tracked in the umbrella JIRA YARN-5597[3].


Documentation:

  - Quick start guide (maven site) - YARN-6484[4].
  - Overall design doc[5] and the slide-deck [6] we used for our talk at
  Hadoop Summit 2016 is available in the umbrella jira - YARN-2915.


Credits:

This is a group effort that could have not been possible without the ideas
and hard work of many other folks and we would like to specifically call
out Giovanni, Botong & Ellen for their invaluable contributions. Also big
thanks to the many folks in community  (Sriram, Kishore, Sarvesh, Jian,
Wangda, Karthik, Vinod, Varun, Inigo, Vrushali, Sangjin, Joep, Rohith and
many more) that helped us shape our ideas and code with very insightful
feedback and comments.

Cheers,
Subru & Carlo

[1] YARN-2915: https://issues.apache.org/jira/browse/YARN-2915
[2] https://github.com/apache/hadoop/tree/YARN-2915
[3] YARN-5597: https://issues.apache.org/jira/browse/YARN-5597
[4] YARN-6484: https://issues.apache.org/jira/browse/YARN-6484
[5] https://issues.apache.org/jira/secure/attachment/12733292/Ya
rn_federation_design_v1.pdf
[6] https://issues.apache.org/jira/secure/attachment/1281922
9/YARN-Federation-Hadoop-Summit_final.pptx
[7] YARN-3671: https://issues.apache.org/jira/browse/YARN-3671
[8] YARN-3673: https://issues.apache.org/jira/browse/YARN-3673
[9]
http://mail-archives.apache.org/mod_mbox/hadoop-yarn-dev/201706.mbox/%3CCAOScs9bSsZ7mzH15Y%2BSPDU8YuNUAq7QicjXpDoX_tKh3MS4HsA%40mail.gmail.com%3E

   

Re: Apache Hadoop 2.8.2 Release Plan

2017-07-21 Thread Jason Lowe
+1 to base the 2.8.2 release off of the more recent activity on branch-2.8.  
Because branch-2.8.2 was cut so long ago it is missing a lot of fixes that are 
in branch-2.8.  There also are a lot of JIRAs that claim they are fixed in 
2.8.2 but are not in branch-2.8.2.  Having the 2.8.2 release be based on recent 
activity in branch-2.8 would solve both of these issues, and we'd only need to 
move the handful of JIRAs that have marked themselves correctly as fixed in 
2.8.3 to be fixed in 2.8.2.

Jason
 

On Friday, July 21, 2017 10:01 AM, Kihwal Lee 
 wrote:
 

 Thanks for driving the next 2.8 release, Junping. While I was committing a 
blocker for 2.7.4, I noticed some of the jiras are back-ported to 2.7, but 
missing in branch-2.8.2.  Perhaps it is safer and easier to simply rebranch 
2.8.2.
Thanks,Kihwal

On Thursday, July 20, 2017, 3:32:16 PM CDT, Junping Du  
wrote:

Hi all,
    Per Vinod's previous email, we just announce Apache Hadoop 2.8.1 get 
released today which is a special security release. Now, we should work towards 
2.8.2 release which aim for production deployment. The focus obviously is to 
fix blocker/critical issues [2], bug-fixes and *no* features / improvements. We 
currently have 13 blocker/critical issues, and 10 of them are Patch Available.

  I plan to cut an RC in a month - target for releasing before end of Aug., to 
give enough time for outstanding blocker / critical issues. Will start moving 
out any tickets that are not blockers and/or won't fit the timeline. For 
progress of releasing effort, please refer our release wiki [2].

  Please share thoughts if you have any. Thanks!

Thanks,

Junping

[1] 2.8.2 release Blockers/Criticals: https://s.apache.org/JM5x
[2] 2.8 Release wiki: 
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release


From: Vinod Kumar Vavilapalli 
Sent: Thursday, July 20, 2017 1:05 PM
To: gene...@hadoop.apache.org
Subject: [ANNOUNCE] Apache Hadoop 2.8.1 is released

Hi all,

The Apache Hadoop PMC has released version 2.8.1. You can get it from this 
page: http://hadoop.apache.org/releases.html#Download
This is a security release in the 2.8.0 release line. It consists of 2.8.0 plus 
security fixes. Users on 2.8.0 are encouraged to upgrade to 2.8.1.

Please note that 2.8.x release line continues to be not yet ready for 
production use. Critical issues are being ironed out via testing and downstream 
adoption. Production users should wait for a subsequent release in the 2.8.x 
line.

Thanks
+Vinod


-
To unsubscribe, e-mail: mapreduce-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: mapreduce-dev-h...@hadoop.apache.org

   

[jira] [Created] (YARN-6846) Nodemanager can fail to fully delete application local directories when applications are killed

2017-07-19 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6846:


 Summary: Nodemanager can fail to fully delete application local 
directories when applications are killed
 Key: YARN-6846
 URL: https://issues.apache.org/jira/browse/YARN-6846
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.1
Reporter: Jason Lowe
Priority: Critical


When an application is killed all of the running containers are killed and the 
app waits for the containers to complete before cleaning up.  As each container 
completes the container directory is deleted via the DeletionService.  After 
all containers have completed the app completes and the app directory is 
deleted.  If the app completes quickly enough then the deletion of the 
container and app directories can race against each other.  If the container 
deletion executor deletes a file just before the application deletion executor 
then it can cause the application deletion executor to fail, leaving the 
remaining entries in the application directory lingering.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6814) TestNMClient fails if container completes before it is killed

2017-07-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6814:


 Summary: TestNMClient fails if container completes before it is 
killed
 Key: YARN-6814
 URL: https://issues.apache.org/jira/browse/YARN-6814
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.1
Reporter: Jason Lowe


TestNMClient#testContainerManagement launches degenerate containers and 
verifies the diagnostics are appropriate when it is killed.  However if the 
container launch process races ahead and the degenerate container completes 
before it is killed then the diagnostics are for a successful container rather 
than a killed container and the unit test fails.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6813) TestAMRMProxy#testE2ETokenRenewal fails sporadically due to race conditions

2017-07-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6813:


 Summary: TestAMRMProxy#testE2ETokenRenewal fails sporadically due 
to race conditions
 Key: YARN-6813
 URL: https://issues.apache.org/jira/browse/YARN-6813
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.1
Reporter: Jason Lowe


The testE2ETokenRenewal test lowers the AM and nodemanager heartbeat intervals 
to only 1.5 seconds.  This leaves very little headroom over the default 
heartbeat intervals of 1 second. If the AM hits a hiccup and runs a bit slower 
than expected the unit test can fail because the RM expires the AM.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6805) NPE in LinuxContainerExecutor due to null PrivilegedOperationException exit code

2017-07-11 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6805:


 Summary: NPE in LinuxContainerExecutor due to null 
PrivilegedOperationException exit code
 Key: YARN-6805
 URL: https://issues.apache.org/jira/browse/YARN-6805
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Affects Versions: 2.8.1
Reporter: Jason Lowe


The LinuxContainerExecutor contains a number of code snippets like this:
{code}
} catch (PrivilegedOperationException e) {
  int exitCode = e.getExitCode();
{code}
PrivilegedOperationException#getExitCode can return null if the operation was 
interrupted, so when the JVM does auto-unboxing on that last line it can NPE if 
there was no exit code.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6797) TimelineWriter does not fully consume the post response

2017-07-10 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6797:


 Summary: TimelineWriter does not fully consume the post response
 Key: YARN-6797
 URL: https://issues.apache.org/jira/browse/YARN-6797
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineclient
Affects Versions: 2.8.1
Reporter: Jason Lowe
Assignee: Jason Lowe


TimelineWriter does not fully consume the response to the POST request, and 
that ends up preventing the HTTP client from being reused for the next write of 
an entity.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6650) ContainerTokenIdentifier is re-encoded during token verification

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6650:


 Summary: ContainerTokenIdentifier is re-encoded during token 
verification
 Key: YARN-6650
 URL: https://issues.apache.org/jira/browse/YARN-6650
 Project: Hadoop YARN
  Issue Type: Bug
  Components: security
Affects Versions: 2.8.0
Reporter: Jason Lowe


A ContainerTokenIdentifier is serialized into bytes and signed by the RM secret 
key.  When the NM needs to verify the identifier, it is decoding the bytes into 
a ContainerTokenIdentifier to get the key ID then re-encoding the identifier 
into a byte buffer to hash it with the key.  This is fine as long as the RM and 
NM both agree how a ContainerTokenIdentifier should be serialized into bytes.

However when the versions of the RM and NM are different and fields were added 
to the identifier between those versions then the NM may end up re-serializing 
the fields in a different order than the RM did, especially when there were 
gaps in the protocol field IDs that were filled in between the versions. If the 
fields are reordered during the re-encoding then the bytes will not match the 
original stream that was signed and the token verification will fail.

The original token identifier bytes received via RPC need to be used by the 
verification process, not the bytes generated by re-encoding the identifier.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6647) ZKRMStateStore can crash during shutdown due to InterruptedException

2017-05-25 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6647:


 Summary: ZKRMStateStore can crash during shutdown due to 
InterruptedException
 Key: YARN-6647
 URL: https://issues.apache.org/jira/browse/YARN-6647
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Reporter: Jason Lowe


Noticed some tests were failing due to the JVM shutting down early.  I was able 
to reproduce this occasionally with TestKillApplicationWithRMHA.  Stacktrace to 
follow.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6628) Unexpected jackson-core-2.2.3 dependency introduced

2017-05-19 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6628:


 Summary: Unexpected jackson-core-2.2.3 dependency introduced
 Key: YARN-6628
 URL: https://issues.apache.org/jira/browse/YARN-6628
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.8.1
Reporter: Jason Lowe
Priority: Blocker


The change in YARN-5894 caused jackson-core-2.2.3.jar to be added in 
share/hadoop/yarn/lib/. This added dependency seems to be incompatible with 
jackson-core-asl-1.9.13.jar which is also shipped as a dependency.  This new 
jackson-core jar ends up breaking jobs that ran fine on 2.8.0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6618) TestNMLeveldbStateStoreService#testCompactionCycle can fail if compaction occurs more than once

2017-05-17 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6618:


 Summary: TestNMLeveldbStateStoreService#testCompactionCycle can 
fail if compaction occurs more than once
 Key: YARN-6618
 URL: https://issues.apache.org/jira/browse/YARN-6618
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe
Assignee: Jason Lowe


The testCompactionCycle unit test is verifying that the compaction cycle occurs 
after startup, but rarely the compaction cycle can occur more than once which 
fails the test.  The unit test needs to account for this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6603) NPE in RMAppsBlock

2017-05-15 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6603:


 Summary: NPE in RMAppsBlock
 Key: YARN-6603
 URL: https://issues.apache.org/jira/browse/YARN-6603
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Jason Lowe


We are seeing an intermittent NPE when the RM is trying to render the /cluster 
URI.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6598) History server getApplicationReport NPE when fetching report for pre-2.8 job

2017-05-12 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6598:


 Summary: History server getApplicationReport NPE when fetching 
report for pre-2.8 job
 Key: YARN-6598
 URL: https://issues.apache.org/jira/browse/YARN-6598
 Project: Hadoop YARN
  Issue Type: Bug
  Components: timelineserver
Affects Versions: 2.8.0
Reporter: Jason Lowe
Assignee: Jason Lowe
Priority: Blocker


ApplicationHistoryManagerOnTimelineStore#convertToApplicationReport can NPE for 
a job that was run prior to the cluster upgrading to 2.8.  It blindly assumes 
preemption metrics are present when CPU metrics are present, and when they are 
not it triggers the NPE.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6558) YARN ContainerLocalizer logs are missing

2017-05-04 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-6558.
--
Resolution: Duplicate

> YARN ContainerLocalizer logs are missing
> 
>
> Key: YARN-6558
> URL: https://issues.apache.org/jira/browse/YARN-6558
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: yarn
>Affects Versions: 2.7.1
>Reporter: Prabhu Joseph
>
> YARN LCE ContainerLocalizer runs as a separate process and the logs / error 
> messages are not captured. We need to redirect them to a stdout or separate 
> log file which helps to debug Localization issues.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-6524) Avoid storing unnecessary information in the Memory for the finished apps

2017-04-25 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-6524.
--
Resolution: Duplicate

> Avoid storing unnecessary information in the Memory for the finished apps
> -
>
> Key: YARN-6524
> URL: https://issues.apache.org/jira/browse/YARN-6524
> Project: Hadoop YARN
>  Issue Type: Bug
>  Components: RM
>Affects Versions: 2.7.3
>Reporter: Naganarasimha G R
>
> Avoid storing unnecessary information in the Memory for the finished apps
> In case of cluster with large number of finished apps, more memory is 
> required to store the unused information i.e. related AM's Container launch 
> like Localization resources, tokens etc. 
> In one such scenario we had around 9k finished apps each with 257 
> LocalResource amounting to 108 kbytes per app and just for 9k apps it was 
> nearly taking ~ 0.8 GB of memory. In Low end machines this would create 
> resource crunch in RM



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2017-04-18 Thread Jason Lowe
Thanks for the pointers, Sean!  According to the infrastructure team, 
apparently it was a typo in the protection scheme that allowed the trunk force 
push to go through.  
 
https://issues.apache.org/jira/browse/INFRA-13902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971643#comment-15971643
   
Jason
 On Monday, April 17, 2017 3:05 PM, Sean Busbey <bus...@cloudera.com> wrote:
 

 disallowing force pushes to trunk was done back in:

* August 2014: INFRA-8195
* February 2016: INFRA-11136

On Mon, Apr 17, 2017 at 11:18 AM, Jason Lowe
<jl...@yahoo-inc.com.invalid> wrote:
> I found at least one commit that was dropped, MAPREDUCE-6673.  I was able to 
> cherry-pick the original commit hash since it was recorded in the commit 
> email.
> This begs the question of why we're allowing force pushes to trunk.  I 
> thought we asked to have that disabled the last time trunk was accidentally 
> clobbered?
> Jason
>
>
>    On Monday, April 17, 2017 10:18 AM, Arun Suresh <asur...@apache.org> wrote:
>
>
>  Hi
>
> I had the Apr-14 eve version of trunk on my local machine. I've pushed that.
> Don't know if anything was committed over the weekend though.
>
> Cheers
> -Arun
>
> On Mon, Apr 17, 2017 at 7:17 AM, Anu Engineer <aengin...@hortonworks.com>
> wrote:
>
>> Hi Allen,
>>
>> https://issues.apache.org/jira/browse/INFRA-13902
>>
>> That happened with ozone branch too. It was an inadvertent force push.
>> Infra has advised us to force push the latest branch if you have it.
>>
>> Thanks
>> Anu
>>
>>
>> On 4/17/17, 7:10 AM, "Allen Wittenauer" <a...@effectivemachines.com> wrote:
>>
>> >Looks like someone reset HEAD back to Mar 31.
>> >
>> >Sent from my iPad
>> >
>> >> On Apr 16, 2017, at 12:08 AM, Apache Jenkins Server <
>> jenk...@builds.apache.org> wrote:
>> >>
>> >> For more details, see https://builds.apache.org/job/
>> hadoop-qbt-trunk-java8-linux-x86/378/
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> -1 overall
>> >>
>> >>
>> >> The following subsystems voted -1:
>> >>    docker
>> >>
>> >>
>> >> Powered by Apache Yetus 0.5.0-SNAPSHOT  http://yetus.apache.org
>> >>
>> >>
>> >>
>> >> -
>> >> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
>> >> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
>> >
>> >
>> >-
>> >To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
>> >For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
>> >
>> >
>>
>>
>
>
>



-- 
busbey


   

Re: Apache Hadoop qbt Report: trunk+JDK8 on Linux/x86

2017-04-17 Thread Jason Lowe
I found at least one commit that was dropped, MAPREDUCE-6673.  I was able to 
cherry-pick the original commit hash since it was recorded in the commit email.
This begs the question of why we're allowing force pushes to trunk.  I thought 
we asked to have that disabled the last time trunk was accidentally clobbered?
Jason
 

On Monday, April 17, 2017 10:18 AM, Arun Suresh  wrote:
 

 Hi

I had the Apr-14 eve version of trunk on my local machine. I've pushed that.
Don't know if anything was committed over the weekend though.

Cheers
-Arun

On Mon, Apr 17, 2017 at 7:17 AM, Anu Engineer 
wrote:

> Hi Allen,
>
> https://issues.apache.org/jira/browse/INFRA-13902
>
> That happened with ozone branch too. It was an inadvertent force push.
> Infra has advised us to force push the latest branch if you have it.
>
> Thanks
> Anu
>
>
> On 4/17/17, 7:10 AM, "Allen Wittenauer"  wrote:
>
> >Looks like someone reset HEAD back to Mar 31.
> >
> >Sent from my iPad
> >
> >> On Apr 16, 2017, at 12:08 AM, Apache Jenkins Server <
> jenk...@builds.apache.org> wrote:
> >>
> >> For more details, see https://builds.apache.org/job/
> hadoop-qbt-trunk-java8-linux-x86/378/
> >>
> >>
> >>
> >>
> >>
> >> -1 overall
> >>
> >>
> >> The following subsystems voted -1:
> >>    docker
> >>
> >>
> >> Powered by Apache Yetus 0.5.0-SNAPSHOT  http://yetus.apache.org
> >>
> >>
> >>
> >> -
> >> To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org
> >> For additional commands, e-mail: common-dev-h...@hadoop.apache.org
> >
> >
> >-
> >To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
> >For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org
> >
> >
>
>


   

[jira] [Created] (YARN-6461) TestRMAdminCLI has very low test timeouts

2017-04-10 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6461:


 Summary: TestRMAdminCLI has very low test timeouts
 Key: YARN-6461
 URL: https://issues.apache.org/jira/browse/YARN-6461
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe


TestRMAdminCLI has only 500 millisecond timeouts on many of the unit tests.  If 
the test machine or VM is loaded/slow then the tests can report a false 
positive.

I'm not sure these tests need explicit timeouts.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6450) TestContainerManagerWithLCE requires override for each new test added to ContainerManagerTest

2017-04-05 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6450:


 Summary: TestContainerManagerWithLCE requires override for each 
new test added to ContainerManagerTest
 Key: YARN-6450
 URL: https://issues.apache.org/jira/browse/YARN-6450
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe
Assignee: Jason Lowe


Every test in TestContainerManagerWithLCE looks like this:
{code}
  @Override
  public void testSomething() throws Exception {
// Don't run the test if the binary is not available.
if (!shouldRunTest()) {
  LOG.info("LCE binary path is not passed. Not running the test");
  return;
}
LOG.info("Running something");
super.testSomething();
  }
{code}

If  a new test is added to ContainerManagerTest then by default 
ContainerManagerTestWithLCE will fail when the LCE has not been configured.  
This is an unnecessary maintenance burden.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6437) TestSignalContainer#testSignalRequestDeliveryToNM fails intermittently

2017-04-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6437:


 Summary: TestSignalContainer#testSignalRequestDeliveryToNM fails 
intermittently
 Key: YARN-6437
 URL: https://issues.apache.org/jira/browse/YARN-6437
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Affects Versions: 2.8.0
Reporter: Jason Lowe
Assignee: Jason Lowe


testSignalRequestDeliveryToNM can fail if the containers are returned across 
multiple scheduling heartbeats.  The loop waiting for all the containers should 
be accumulating the containers but instead is smashing the same list of 
containers with whatever the allocate call returns.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6436) TestSchedulingPolicy#testParseSchedulingPolicy timeout is too low

2017-04-04 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6436:


 Summary: TestSchedulingPolicy#testParseSchedulingPolicy timeout is 
too low
 Key: YARN-6436
 URL: https://issues.apache.org/jira/browse/YARN-6436
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


The timeout for testParseSchedulingPolicy is only one second.  An I/O hiccup on 
a VM can make this test fail for the wrong reasons.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: [VOTE] Release Apache Hadoop 2.8.0 (RC3)

2017-03-17 Thread Jason Lowe
+1 (binding)
- Verfied signatures and digests- Performed a native build from the release 
tag- Deployed to a single node cluster- Ran some sample jobs
Jason
 

On Friday, March 17, 2017 4:18 AM, Junping Du  wrote:
 

 Hi all,
    With fix of HDFS-11431 get in, I've created a new release candidate (RC3) 
for Apache Hadoop 2.8.0.

    This is the next minor release to follow up 2.7.0 which has been released 
for more than 1 year. It comprises 2,900+ fixes, improvements, and new 
features. Most of these commits are released for the first time in branch-2.

      More information about the 2.8.0 release plan can be found here: 
https://cwiki.apache.org/confluence/display/HADOOP/Hadoop+2.8+Release

      New RC is available at: 
http://home.apache.org/~junping_du/hadoop-2.8.0-RC3

      The RC tag in git is: release-2.8.0-RC3, and the latest commit id is: 
91f2b7a13d1e97be65db92ddabc627cc29ac0009

      The maven artifacts are available via repository.apache.org at: 
https://repository.apache.org/content/repositories/orgapachehadoop-1057

      Please try the release and vote; the vote will run for the usual 5 days, 
ending on 03/22/2017 PDT time.

Thanks,

Junping

   

Re: Two AMs in one YARN container?

2017-03-16 Thread Jason Lowe
The doAs method in UserGroupInformation is what you want when dealing with 
multiple UGIs.  It determines what UGI instance the code within the doAs scope 
gets when that code tries to lookup the current user.
Each AM is designed to run in a separate JVM, so each has some main()-like 
entry point that does everything to setup the AM.  Theoretically all you need 
to do is create two, separate UGIs then use each instance to perform a doAs 
wrapping the invocation of the corresponding AM's entry point.  After that, 
everything that AM does will get the UGI of the doAs invocation as the current 
user.  Since the AMs are running in separate doAs instances they will get 
separate UGIs for the current user and thus separate credentials.
Jason
 

On Thursday, March 16, 2017 4:03 PM, Sergiy Matusevych 
<sergiy.matusev...@gmail.com> wrote:
 

 Hi Jason,

Thanks a lot for your help again! Having two separate UserGroupInformation 
instances is exactly what I had in mind. What I do not understand, though, is 
how to make sure that our second call to .regsiterApplicationMaster() will pick 
the right UserGroupInformation object. I would love to find a way that does not 
involve any changes to the YARN client, but if we have to patch it, of course, 
I agree that we need to have a generic yet minimally invasive solution.
Thank you!Sergiy.


On Thu, Mar 16, 2017 at 8:03 AM, Jason Lowe <jl...@yahoo-inc.com> wrote:
>
> I believe a cleaner way to solve this problem is to create two, _separate_ 
> UserGroupInformation objects and wrap each AM instances in a UGI doAs so they 
> aren't trying to share the same credentials.  This is one example of a token 
> bleeding over and causing problems. I suspect trying to fix these one-by-one 
> as they pop up is going to be frustrating compared to just ensuring the 
> credentials remain separate as if they really were running in separate JVMs.
>
> Adding Daryn who knows a lot more about the UGI stuff so he can correct any 
> misunderstandings on my part.
>
> Jason
>
>
> On Wednesday, March 15, 2017 1:11 AM, Sergiy Matusevych 
> <sergiy.matusev...@gmail.com> wrote:
>
>
> Hi YARN developers,
>
> I have an interesting problem that I think is related to YARN Java client.
> I am trying to launch *two* application masters in one container. To be
> more specific, I am starting a Spark job on YARN, and launch an Apache REEF
> Unmanaged AM from the Spark Driver.
>
> Technically, YARN Resource Manager should not care which process each AM
> runs in. However, there is a problem with the YARN Java client
> implementation: there is a global UserGroupInformation object that holds
> the user credentials of the current RM session. This data structure is
> shared by all AMs, and when REEF application tries to register the second
> (unmanaged) AM, the client library presents to YARN RM all credentials,
> including the security token of the first (managed) AM. YARN rejects such
> registration request, throwing InvalidApplicationMasterRequestException
> "Application Master is already registered".
>
> I feel like this issue can be resolved by a relatively small update to the
> YARN Java client - e.g. by introducing a new variant of the
> AMRMClientAsync.registerApplicationMaster() that would take the required
> security token (instead of getting it implicitly from the
> UserGroupInformation.getCurrentUser().getCredentials() etc.), or having
> some sort of RM session class that would wrap all data that is currently
> global. I need to think about the elegant API for it.
>
> What do you guys think? I would love to work on this problem and send you a
> pull request for the upcoming 2.9 release.
>
> Cheers,
> Sergiy.
>
>


   

[jira] [Created] (YARN-6354) RM fails to upgrade to 2.8 with leveldb state store

2017-03-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6354:


 Summary: RM fails to upgrade to 2.8 with leveldb state store
 Key: YARN-6354
 URL: https://issues.apache.org/jira/browse/YARN-6354
 Project: Hadoop YARN
  Issue Type: Bug
  Components: resourcemanager
Affects Versions: 2.8.0
Reporter: Jason Lowe
Priority: Critical


When trying to upgrade an RM to 2.8 it fails with a 
StringIndexOutOfBoundsException trying to load reservation state.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: Two AMs in one YARN container?

2017-03-16 Thread Jason Lowe
I believe a cleaner way to solve this problem is to create two, _separate_ 
UserGroupInformation objects and wrap each AM instances in a UGI doAs so they 
aren't trying to share the same credentials.  This is one example of a token 
bleeding over and causing problems. I suspect trying to fix these one-by-one as 
they pop up is going to be frustrating compared to just ensuring the 
credentials remain separate as if they really were running in separate JVMs.
Adding Daryn who knows a lot more about the UGI stuff so he can correct any 
misunderstandings on my part.
Jason
 

On Wednesday, March 15, 2017 1:11 AM, Sergiy Matusevych 
 wrote:
 

 Hi YARN developers,

I have an interesting problem that I think is related to YARN Java client.
I am trying to launch *two* application masters in one container. To be
more specific, I am starting a Spark job on YARN, and launch an Apache REEF
Unmanaged AM from the Spark Driver.

Technically, YARN Resource Manager should not care which process each AM
runs in. However, there is a problem with the YARN Java client
implementation: there is a global UserGroupInformation object that holds
the user credentials of the current RM session. This data structure is
shared by all AMs, and when REEF application tries to register the second
(unmanaged) AM, the client library presents to YARN RM all credentials,
including the security token of the first (managed) AM. YARN rejects such
registration request, throwing InvalidApplicationMasterRequestException
"Application Master is already registered".

I feel like this issue can be resolved by a relatively small update to the
YARN Java client - e.g. by introducing a new variant of the
AMRMClientAsync.registerApplicationMaster() that would take the required
security token (instead of getting it implicitly from the
UserGroupInformation.getCurrentUser().getCredentials() etc.), or having
some sort of RM session class that would wrap all data that is currently
global. I need to think about the elegant API for it.

What do you guys think? I would love to work on this problem and send you a
pull request for the upcoming 2.9 release.

Cheers,
Sergiy.


   

[jira] [Created] (YARN-6349) Container kill request from AM can be lost if container is still recovering

2017-03-16 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6349:


 Summary: Container kill request from AM can be lost if container 
is still recovering
 Key: YARN-6349
 URL: https://issues.apache.org/jira/browse/YARN-6349
 Project: Hadoop YARN
  Issue Type: Bug
  Components: nodemanager
Reporter: Jason Lowe


If container recovery takes an excessive amount of time (e.g.: HDFS is slow) 
then the NM could start servicing requests before all containers have 
recovered.  If an AM tries to kill a container while it is still recovering 
then this kill request could be lost.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: LevelDB corruption in YARN Application TimelineServer

2017-03-06 Thread Jason Lowe
Verify that something outside of Hadoop/YARN is not coming along periodically 
and removing "old" files (e.g.: tmpwatch, etc.).  Users have reported similar 
cases in the past that were tracked down to an invalid setup.  State was being 
corrupted by a periodic cleanup tool, like tmpwatch, removing files.
Jason
 

On Thursday, March 2, 2017 5:59 PM, Abhishek Das  
wrote:
 

 Hi,

I am running a hadoop 2.6.0 cluster in ec2 instances with r3.2xlarge as
instance of the master node. YARN Application TimelineServer running in the
master node is throwing an exception because of leveldb corruption. This
issue seems to be happening when the cluster has been up for a long time
(more than 7 days). The stack trace is given below.

ERROR org.apache.hadoop.yarn.server.timeline.TimelineDataManager: Skip the
timeline entity: { id: , type: TEZ_TASK_ID }
java.lang.RuntimeException:
org.fusesource.leveldbjni.internal.NativeDB$DBException: *IO error:
/media/ephemeral0/hadoop-root/yarn/timeline/leveldb-timeline-store.ldb/330951.sst:
No such file or directory*
        at
org.fusesource.leveldbjni.internal.JniDBIterator.seek(JniDBIterator.java:68)
        at
org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.getEntity(LeveldbTimelineStore.java:444)
        at
org.apache.hadoop.yarn.server.timeline.TimelineDataManager.postEntities(TimelineDataManager.java:257)
        at
org.apache.hadoop.yarn.server.timeline.webapp.TimelineWebServices.postEntities(TimelineWebServices.java:259)
        at sun.reflect.GeneratedMethodAccessor35.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
com.sun.jersey.spi.container.JavaMethodInvokerFactory$1.invoke(JavaMethodInvokerFactory.java:60)
        at
com.sun.jersey.server.impl.model.method.dispatch.AbstractResourceMethodDispatchProvider$TypeOutInvoker._dispatch(AbstractResourceMethodDispatchProvider.java:185)
        at
com.sun.jersey.server.impl.model.method.dispatch.ResourceJavaMethodDispatcher.dispatch(ResourceJavaMethodDispatcher.java:75)
        at
com.sun.jersey.server.impl.uri.rules.HttpMethodRule.accept(HttpMethodRule.java:288)
        at
com.sun.jersey.server.impl.uri.rules.ResourceClassRule.accept(ResourceClassRule.java:108)
        at
com.sun.jersey.server.impl.uri.rules.RightHandPathRule.accept(RightHandPathRule.java:147)
        at
com.sun.jersey.server.impl.uri.rules.RootResourceClassesRule.accept(RootResourceClassesRule.java:84)
        at
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1469)
        at
com.sun.jersey.server.impl.application.WebApplicationImpl._handleRequest(WebApplicationImpl.java:1400)
        at
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1349)
        at
com.sun.jersey.server.impl.application.WebApplicationImpl.handleRequest(WebApplicationImpl.java:1339)
        at
com.sun.jersey.spi.container.servlet.WebComponent.service(WebComponent.java:416)
        at
com.sun.jersey.spi.container.servlet.ServletContainer.service(ServletContainer.java:537)
        at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:886)
        at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:834)
        at
com.sun.jersey.spi.container.servlet.ServletContainer.doFilter(ServletContainer.java:795)
        at
com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
        at
com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
        at
com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
        at
com.google.inject.servlet.GuiceFilter.doFilter(GuiceFilter.java:113)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at
org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:96)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at
org.apache.hadoop.yarn.server.timeline.webapp.CrossOriginFilter.doFilter(CrossOriginFilter.java:95)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:572)
        at
org.apache.hadoop.security.token.delegation.web.DelegationTokenAuthenticationFilter.doFilter(DelegationTokenAuthenticationFilter.java:269)
        at
org.apache.hadoop.security.authentication.server.AuthenticationFilter.doFilter(AuthenticationFilter.java:542)
        at
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
        at

[jira] [Created] (YARN-6217) TestLocalCacheDirectoryManager test timeout is too aggressive

2017-02-22 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6217:


 Summary: TestLocalCacheDirectoryManager test timeout is too 
aggressive
 Key: YARN-6217
 URL: https://issues.apache.org/jira/browse/YARN-6217
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


TestLocalCacheDirectoryManager#testDirectoryStateChangeFromFullToNonFull has 
only a one second timeout.  If the test machine hits an I/O hiccup it can fail. 
 The test timeout is too aggressive, and I question whether this test even 
needs an explicit timeout specified.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-6191) CapacityScheduler preemption by container priority can be problematic for MapReduce

2017-02-14 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-6191:


 Summary: CapacityScheduler preemption by container priority can be 
problematic for MapReduce
 Key: YARN-6191
 URL: https://issues.apache.org/jira/browse/YARN-6191
 Project: Hadoop YARN
  Issue Type: Bug
  Components: capacityscheduler
Reporter: Jason Lowe


A MapReduce job with thousands of reducers and just a couple of maps left to go 
was running in a preemptable queue.  Periodically other queues would get busy 
and the RM would preempt some resources from the job, but it _always_ picked 
the job's map tasks first because they use the lowest priority containers.  
Even though the reducers had a shorter running time, most were spared but the 
maps were always shot.  Since the map tasks ran for a longer time than the 
preemption period, the job was in a perpetual preemption loop.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: YARN Client and Unmanaged AM running in the same process?

2017-01-24 Thread Jason Lowe
va:786)
at
org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:108)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:850)
at
org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:831)
at
org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:175)
at
org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:108)
at java.lang.Thread.run(Thread.java:745)
2017-01-23 19:39:42,453 INFO org.apache.hadoop.ipc.Server: Socket Reader #1
for port 8032: readAndProcess from client 10.130.68.120 threw exception
[java.io.IOException: An existing connection was forcibly closed by the
remote host]
java.io.IOException: An existing connection was forcibly closed by the
remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2635)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:136)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1492)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:782)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:648)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:619)
2017-01-23 19:39:42,453 INFO org.apache.hadoop.ipc.Server: Socket Reader #1
for port 8030: readAndProcess from client 10.130.68.120 threw exception
[java.io.IOException: An existing connection was forcibly closed by the
remote host]
java.io.IOException: An existing connection was forcibly closed by the
remote host
at sun.nio.ch.SocketDispatcher.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:43)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:197)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.hadoop.ipc.Server.channelRead(Server.java:2635)
at org.apache.hadoop.ipc.Server.access$2800(Server.java:136)
at org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:1492)
at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:782)
at org.apache.hadoop.ipc.Server$Listener$Reader.doRunLoop(Server.java:648)
at org.apache.hadoop.ipc.Server$Listener$Reader.run(Server.java:619)
2017-01-23 19:50:59,224 INFO
org.apache.hadoop.yarn.util.AbstractLivelinessMonitor:
Expired:appattempt_1465994698013_1252_01 Timed out after 600 secs


On Tue, Jan 24, 2017 at 8:48 AM, Jason Lowe <jl...@yahoo-inc.com> wrote:
>
> Have you looked at the ResourceManager logs to see what it is doing when
it receives the unregister request?  I'm wondering if there's an exception
or error that could explain why it is not working as expected.  The sample
code works against trunk for me.  The unmanaged AM registered,
unregistered, and the final status of the application on the RM was
FINISHED/SUCCEEDED.
>
> Jason
>
>
> On Monday, January 23, 2017 9:51 PM, Sergiy Matusevych <
sergiy.matusev...@gmail.com> wrote:
>
>
> Hi fellow YARN developers,
>
> I am writing a YARN application that runs both Client *and* Unmanaged
> Application Master in the same JVM process. I have a toy example that
> starts a YARN application and the AM in Unmanaged mode, and then just
shuts
> it down:
>
>
https://github.com/apache/reef/blob/master/lang/java/reef-runtime-yarn/src/test/java/org/apache/reef/runtime/yarn/driver/unmanaged/UnmanagedAmTest.java
>
> (I wrapped it as a unit test, but the code is 100% independent of REEF, so
> you can copy & paste it if you want to play with it; I can also build a
> small maven project around it).
>
> The app *almost* works - the problem seems to be that the call on line 117
>
>    rmClient.unregisterApplicationMaster(FinalApplicationStatus.SUCCEEDED,
> "Success!", null);
>
> fails to update the status of the application on the Resource Manager
side.
> I would expect it to be FINISHED/SUCCEDED, but instead it stays in
> RUNNING/UNDEFINED well after the client/AM process terminates, and
> eventually got marked as FAILED/FAILED by the RM.
>
> Am I doing something wrong, or is that a bug in YARN? I found an old JIRA
> issue that seems to be related to the problem:
>
> https://issues.apache.org/jira/browse/YARN-273
> "Add an unmanaged AM client for in-process AMs"
>
> Can someone confirm if my problem is indeed related to that issue, or is
> there something wrong with my code?
>
> Thank you!
> Sergiy.
>
>


   

Re: YARN Client and Unmanaged AM running in the same process?

2017-01-24 Thread Jason Lowe
Have you looked at the ResourceManager logs to see what it is doing when it 
receives the unregister request?  I'm wondering if there's an exception or 
error that could explain why it is not working as expected.  The sample code 
works against trunk for me.  The unmanaged AM registered, unregistered, and the 
final status of the application on the RM was FINISHED/SUCCEEDED.
Jason
 

On Monday, January 23, 2017 9:51 PM, Sergiy Matusevych 
 wrote:
 

 Hi fellow YARN developers,

I am writing a YARN application that runs both Client *and* Unmanaged
Application Master in the same JVM process. I have a toy example that
starts a YARN application and the AM in Unmanaged mode, and then just shuts
it down:

https://github.com/apache/reef/blob/master/lang/java/reef-runtime-yarn/src/test/java/org/apache/reef/runtime/yarn/driver/unmanaged/UnmanagedAmTest.java

(I wrapped it as a unit test, but the code is 100% independent of REEF, so
you can copy & paste it if you want to play with it; I can also build a
small maven project around it).

The app *almost* works - the problem seems to be that the call on line 117

    rmClient.unregisterApplicationMaster(FinalApplicationStatus.SUCCEEDED,
"Success!", null);

fails to update the status of the application on the Resource Manager side.
I would expect it to be FINISHED/SUCCEDED, but instead it stays in
RUNNING/UNDEFINED well after the client/AM process terminates, and
eventually got marked as FAILED/FAILED by the RM.

Am I doing something wrong, or is that a bug in YARN? I found an old JIRA
issue that seems to be related to the problem:

https://issues.apache.org/jira/browse/YARN-273
"Add an unmanaged AM client for in-process AMs"

Can someone confirm if my problem is indeed related to that issue, or is
there something wrong with my code?

Thank you!
Sergiy.


   

[jira] [Created] (YARN-5859) TestResourceLocalizationService#testParallelDownloadAttemptsForPublicResource sometimes fails

2016-11-08 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5859:


 Summary: 
TestResourceLocalizationService#testParallelDownloadAttemptsForPublicResource 
sometimes fails
 Key: YARN-5859
 URL: https://issues.apache.org/jira/browse/YARN-5859
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


Saw the following test failure:
{noformat}
Running 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
Tests run: 14, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 12.011 sec <<< 
FAILURE! - in 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService
testParallelDownloadAttemptsForPublicResource(org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService)
  Time elapsed: 0.586 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.TestResourceLocalizationService.testParallelDownloadAttemptsForPublicResource(TestResourceLocalizationService.java:2108)
{noformat}
The assert occurred at this place in the code:
{code}
  // Waiting for download to start.
  Assert.assertTrue(waitForPublicDownloadToStart(spyService, 1, 200));
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Resolved] (YARN-1468) TestRMRestart.testRMRestartWaitForPreviousAMToFinish get failed.

2016-10-27 Thread Jason Lowe (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Lowe resolved YARN-1468.
--
Resolution: Duplicate

Closing this as a duplicate of YARN-5416 since that other JIRA has a proposed 
patch and recent discussion.

> TestRMRestart.testRMRestartWaitForPreviousAMToFinish get failed.
> 
>
> Key: YARN-1468
> URL: https://issues.apache.org/jira/browse/YARN-1468
> Project: Hadoop YARN
>  Issue Type: Test
>  Components: resourcemanager
>Reporter: Junping Du
>Assignee: Junping Du
>Priority: Critical
>
> Log is as following:
> {code}
> Tests run: 13, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 149.968 sec 
> <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart
> testRMRestartWaitForPreviousAMToFinish(org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart)
>   Time elapsed: 44.197 sec  <<< FAILURE!
> junit.framework.AssertionFailedError: AppAttempt state is not correct 
> (timedout) expected: but was:
> at junit.framework.Assert.fail(Assert.java:50)
> at junit.framework.Assert.failNotEquals(Assert.java:287)
> at junit.framework.Assert.assertEquals(Assert.java:67)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.MockAM.waitForState(MockAM.java:82)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.MockRM.sendAMLaunched(MockRM.java:292)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.launchAM(TestRMRestart.java:826)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.TestRMRestart.testRMRestartWaitForPreviousAMToFinish(TestRMRestart.java:464)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



Re: Incompatible change in Hadoop 2.8.0 (YARN-4126)

2016-10-20 Thread Jason Lowe
You'll probably get better traction by commenting on the offending JIRA.  I 
noticed none of the people involved in YARN-4126 are watchers of YARN-5750, so 
I'm guessing they simply missed seeing YARN-5750 when it was filed.
I haven't looked into the details of YARN-4126, but on the surface it seems 
appropriate to revert for 2.8 if it broke a major downstream project.
Jason
 

On Wednesday, October 19, 2016 4:18 PM, Robert Kanter 
 wrote:
 

 Hi,

Our Oozie team recently found that YARN-4126
 appears to introduce an
incompatible change into Hadoop 2.8.0.  Namely, it removes delegation
tokens from being used in a non-secure cluster.  Clients, such as Oozie,
that were previously using a delegation token regardless of security, now
fail.  That's okay for Hadoop 3, but this was also committed to Hadoop
2.8.0.  Perhaps we should revert this?  YARN-4126 was also not marked as
incompatible...

Caused by: java.io.IOException: Delegation Token can be issued only with
kerberos authentication
at
org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getDelegationToken(ClientRMService.java:1065)
... 10 more


There's more details on YARN-5750
, but I thought I'd try to
bring more attention to this by sending out and email.


thanks
- Robert


   

Re: [VOTE] Release Apache Hadoop 2.6.5 (RC1)

2016-10-10 Thread Jason Lowe
+1 (binding)
- Verified signatures and digests- Built native from source- Deployed to a 
single-node cluster and ran some sample jobs
Jason
 

On Sunday, October 2, 2016 7:13 PM, Sangjin Lee  wrote:
 

 Hi folks,

I have pushed a new release candidate (R1) for the Apache Hadoop 2.6.5
release (the next maintenance release in the 2.6.x release line). RC1
contains fixes to CHANGES.txt, and is otherwise identical to RC0.

Below are the details of this release candidate:

The RC is available for validation at:
http://home.apache.org/~sjlee/hadoop-2.6.5-RC1/.

The RC tag in git is release-2.6.5-RC1 and its git commit is
e8c9fe0b4c252caf2ebf1464220599650f119997.

The maven artifacts are staged via repository.apache.org at:
https://repository.apache.org/content/repositories/orgapachehadoop-1050/.

You can find my public key at
http://svn.apache.org/repos/asf/hadoop/common/dist/KEYS.

Please try the release and vote. The vote will run for the usual 5 days. I
would greatly appreciate your timely vote. Thanks!

Regards,
Sangjin


   

[jira] [Created] (YARN-5655) TestContainerManagerSecurity is failing

2016-09-15 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5655:


 Summary: TestContainerManagerSecurity is failing
 Key: YARN-5655
 URL: https://issues.apache.org/jira/browse/YARN-5655
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.8.0
Reporter: Jason Lowe


TestContainerManagerSecurity has been failing recently in 2.8:
{noformat}
Running org.apache.hadoop.yarn.server.TestContainerManagerSecurity
Tests run: 2, Failures: 1, Errors: 1, Skipped: 0, Time elapsed: 80.928 sec <<< 
FAILURE! - in org.apache.hadoop.yarn.server.TestContainerManagerSecurity
testContainerManager[0](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
  Time elapsed: 44.478 sec  <<< ERROR!
java.lang.NullPointerException: null
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.waitForContainerToFinishOnNM(TestContainerManagerSecurity.java:394)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:337)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:157)

testContainerManager[1](org.apache.hadoop.yarn.server.TestContainerManagerSecurity)
  Time elapsed: 34.964 sec  <<< FAILURE!
java.lang.AssertionError: null
at org.junit.Assert.fail(Assert.java:86)
at org.junit.Assert.assertTrue(Assert.java:41)
at org.junit.Assert.assertTrue(Assert.java:52)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testNMTokens(TestContainerManagerSecurity.java:333)
at 
org.apache.hadoop.yarn.server.TestContainerManagerSecurity.testContainerManager(TestContainerManagerSecurity.java:157)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



[jira] [Created] (YARN-5653) testNonLabeledResourceRequestGetPreferrenceToNonLabeledNode fails intermittently

2016-09-14 Thread Jason Lowe (JIRA)
Jason Lowe created YARN-5653:


 Summary: 
testNonLabeledResourceRequestGetPreferrenceToNonLabeledNode fails intermittently
 Key: YARN-5653
 URL: https://issues.apache.org/jira/browse/YARN-5653
 Project: Hadoop YARN
  Issue Type: Bug
  Components: test
Reporter: Jason Lowe


Saw the following TestNodeLabelContainerAllocation failure in a recent 
precommit:
{noformat}
Running 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
Tests run: 19, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 113.791 sec 
<<< FAILURE! - in 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation
testNonLabeledResourceRequestGetPreferrenceToNonLabeledNode(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation)
  Time elapsed: 0.266 sec  <<< FAILURE!
java.lang.AssertionError: expected:<0> but was:<1>
at org.junit.Assert.fail(Assert.java:88)
at org.junit.Assert.failNotEquals(Assert.java:743)
at org.junit.Assert.assertEquals(Assert.java:118)
at org.junit.Assert.assertEquals(Assert.java:555)
at org.junit.Assert.assertEquals(Assert.java:542)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation.checkLaunchedContainerNumOnNode(TestNodeLabelContainerAllocation.java:562)
at 
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestNodeLabelContainerAllocation.testNonLabeledResourceRequestGetPreferrenceToNonLabeledNode(TestNodeLabelContainerAllocation.java:842)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-dev-h...@hadoop.apache.org



  1   2   3   4   >