[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-014.patch Patch -014; address feedback h3. {{FileSystemTimelineWriter.java}} bq. TIMELINE_SERVICE_ENTITYFILE_FS_SUPPORT_APPEND move to YarnConfiguration? done bq. Why LogFDsCache#flush was changed into synchronized? I believe we're doing fine-grained locking here (with each of the FDs), and only flush in LogFDsCache is marked as synchronized? What am I missing here? I'm not sure now, I think I was worried about two flush() calls at the same time. I've taken it out. h3. {{TimelineWriter.java}} bq. Not sure if "Direct timeline writer" is clear enough to indicate where the data goes to and which pattern the writer is following? By saying "direct" here, do we mean we're using a write-through strategy? I'd meant not going via the FS, but yes, utterly uninformative, especially given we have the URL of the endpoint. Now {{"Timeline writer posting to " + resURI}} h3. {{EntityGroupFSTimelineStore.java}} bq. In scanActiveLogs, the new variable "scanned" looks like a little bit confusing: when we return the variable scanned, the actual scanning jobs are not guaranteed to be done. So it looks like something "to be scanned" when we return? My only concern is this naming may give people false indication that by the time this method returns, there are a number of logs that are already scanned. This also applies to EntityLogScanner now {{logsToScanCount}} > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, > YARN-4696-010.patch, YARN-4696-012.patch, YARN-4696-013.patch, > YARN-4696-014.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-013.patch Patch -013, "special scale edition" # the various log enum & check executors now exit if, between files, the executor terminates. This appears to reduce the reoccurrence of YARN-4772 and leveldb problems. # ordering of test setup changed to apply FS between {{EntityGroupFSTimelineStore}} init and start of work; {{EntityGroupFSTimelineStore}} modified to support this. > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, > YARN-4696-010.patch, YARN-4696-012.patch, YARN-4696-013.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-012.patch Patch 012. This is what I've been successfully using in tests —giving up trying to have incomplete apps if the FS is LocalFileSystem, and instead using an MiniHDFSCluster for those test cases. The tests all work. This is ready for review and ideally, getting into 2.8 > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, > YARN-4696-010.patch, YARN-4696-012.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-010.patch YARN-4696 patch 010. Checkstyle warnings. The FileSystemTimelineWriter use FileSystem.newInstance() to create a new FS instance, with the chosen retry policies. > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch, > YARN-4696-010.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-009.patch This is the 009 patch; the difference with 008 is it is correctly converting IllegalArgumentException to a BadRequestException with the nested stack trace. With this patch applied with the current YARN-4545 patch, I now successfully have # all tests against completed jobs working with file:// # tests needing to track incomplete jobs working with an HDFS minicluster. LocalFS isn't going to work as a destination for incomplete jobs, as it doesn't flush(). Nor will things like S3. That'll need documenting > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch, YARN-4696-009.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-008.patch Patch -008. This removes a subclass of RawLocalFileSystem that I'd been trying to instantiate directly. That doesn't work...I won't go into the details. Note also that patch -007 # has the code to remember the cache option before the {{FileSystemTimelineWriter}} gets a file, and restores it after # has commented out the entire action of disabling the cache. Why #2? It's to try to get a local FS with checksumming disabled picked up in test cases. I've not got that working. Why #1? Because some other part of the JVM may want caching, and so they won't want this class disabling it for them. I'm assuming that the caching was disabled to ensure that if this class closed the fs instance then the solution there is: don't close the FS when the service is stopped. We can rely on Hadoop itself to stop all filesystems in JVM shutdown. Of course, if the concern is that its other bits of code closing the FS, that's harder. In such a case, if I do manage to get my local FS test working, then we may need a test-time option to not-disable the cache > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch, YARN-4696-008.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-007.patch Patch 007 files that are stat-ed as empty are not skipped, but no attempt is made to log a parse problem if the length is 0 and no data has ever been read from it before (i.e. offset=0). > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch, > YARN-4696-007.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-006.patch Patch 006; ongoing (and currently unsuccessful) attempt to use file:// as a destination for timeline entities * some better logging of read problems to differentiate empty file from missing file. * add cleanup of TimelineDataManager in try-with-resources * explictly thrown an FNFE if the active dir isn't found (Rather than a generic IOE) * the constant {{FileSystemTimelineWriter.TIMELINE_SERVICE_ENTITYFILE_FS_SUPPORT_APPEND}} is public, so that you can turn off append support. I know we want a proper API here (HADOOP-9565), but it's not done yet: a flag is all you have. Making the constant public will make it easier to track down use in future. * includes YARN-4716; flush() interface. This propagates all the way down to the FS API (good), but as file:// is a CRC filesystem, flush/hflush doesn't actually work (it buffers until a CRC-block of data is ready). And there's no way to turn off that feature via a config option. What I'm seeing then is that when an app completes its changes are picked up fine. But incomplete apps aren't, instead the scanner is seeing an 0-byte file and skipping it. Which isn't that useful at all. I suspect the issue here is hdfs vs file filesystem behaviours, something I could fix by moving to miniHFDS. My fear here is that people may want to use file:// or similar FS in production, and what we have today doesn't work. > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch, YARN-4696-006.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-005.patch Patch -005 (there was an -004, but I Don't think I submitted it) # log of post, put and file save # make close() operation robust # scan for files skips files that exist but are 0 bytes long. That is not an error > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch, YARN-4696-005.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-003.patch Patch -003 Addresses # discussion # checkstyle # javadocs Adds a test to verify lifecycle walkthrough and invocation of lookup > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch, > YARN-4696-003.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-002.patch Patch 002 # removes switch in exchange for making creation/use of RM something that can be subclassed or mocked away # switched to CompositeService for automatic handling of child service lifecycle; by adding yarnclient & the others they get this lifecycle (and there are no need for special yarnClient!=null checks anywhere in the code. # also cleaned up the {{cacheItem.getStore().close()}} calls -I managed to get an NPE if the store was null; they are services so can be handled via {{ServiceOperations}} Finally: when the web API catches an illegal argument exception (or any other), the string value is included. This helps track down problems like application ID conversion trouble in your plugin, which would otherwise fail with no meaningful error messages or stack traces either on the client or the server > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran >Assignee: Steve Loughran > Attachments: YARN-4696-001.patch, YARN-4696-002.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4696) EntityGroupFSTimelineStore to work in the absence of an RM
[ https://issues.apache.org/jira/browse/YARN-4696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Steve Loughran updated YARN-4696: - Attachment: YARN-4696-001.patch Patch -001; thing I had to to do to get my (external, spark integration) test closer to working. These are a combination of things that are absolutely needed (disabling RM, flushing on close()), generally better (exception handling), and needed to debug what's going on (all the improved logging) # RM integration can be disabled, the timeline store then only uses modified times as a liveness test. This includes checks for null around uses of yarnClient; # I took the opportunity to clean up service shutdown in the process. # YARN-4695 recommendations: all worker threads unwrap exceptions and, if interrupted exceptions, skip the stack trace. # better logging @ debug (including # of scanned apps) # {{TimelineWriter}} doesn't rewrap IOEs in IOEs, wraps interrupted exception into {{InterruptedIOException}} # {{FileSystemTimelineWriter.close()}} does a {{flush()}}. Stops any last events getting lost. There are tests, but not here. Look in https://github.com/steveloughran/spark-timeline-integration > EntityGroupFSTimelineStore to work in the absence of an RM > -- > > Key: YARN-4696 > URL: https://issues.apache.org/jira/browse/YARN-4696 > Project: Hadoop YARN > Issue Type: Sub-task > Components: timelineserver >Affects Versions: 2.8.0 >Reporter: Steve Loughran > Attachments: YARN-4696-001.patch > > > {{EntityGroupFSTimelineStore}} now depends on an RM being up and running; the > configuration pointing to it. This is a new change, and impacts testing where > you have historically been able to test without an RM running. > The sole purpose of the probe is to automatically determine if an app is > running; it falls back to "unknown" if not. If the RM connection was > optional, the "unknown" codepath could be called directly, relying on age of > file as a metric of completion > Options > # add a flag to disable RM connect > # skip automatically if RM not defined/set to 0.0.0.0 > # disable retries on yarn client IPC; if it fails, tag app as unknown. -- This message was sent by Atlassian JIRA (v6.3.4#6332)