[ 
https://issues.apache.org/jira/browse/SAMZA-563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710144#comment-14710144
 ] 

Jon Bringhurst commented on SAMZA-563:
--------------------------------------

Hey everyone, I just thought I'd mention that as of today, we've finally 
upgraded all of our prod Yarn clusters running Samza jobs to 2.7.1 (the actual 
NM/RM version, not the version jobs use). We haven't seen any major problems.

Personally, I'm ok with a 2.6.0 upgrade for the framework dependency.

Here's the versions of Yarn we've used in production with Samza over the years:

0.23.0
1.0.0
2.0.0
2.2.0
2.4.0
2.4.1
2.5.1
2.6.0
2.7.1

Since it might be useful, here's an example yarn-site.xml used with 2.7.1:

{noformat}
<?xml version="1.0"?>

<configuration>

  <property>
    <description>The maximum number of application attempts. It's a global
    setting for all application masters. Each application master can specify
    its individual maximum number of application attempts via the API, but the
    individual number cannot be more than the global upper bound. If it is,
    the resourcemanager will override it. The default number is set to 2, to
    allow at least one retry for AM.</description>
    <name>yarn.resourcemanager.am.max-attempts</name>
    <value>50</value>
  </property>

  <property>
    <description>The class to use as the resource scheduler.</description>
    <name>yarn.resourcemanager.scheduler.class</name>
    
<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
  </property>

  <property>
    <description>The minimum allocation for every container request at the RM,
    in MBs. Memory requests lower than this won't take effect,
    and the specified value will get allocated at minimum.</description>
    <name>yarn.scheduler.minimum-allocation-mb</name>
    <value>128</value>
  </property>

  <property>
    <description>The maximum allocation for every container request at the RM,
    in MBs. Memory requests higher than this won't take effect,
    and will get capped to this value.</description>
    <name>yarn.scheduler.maximum-allocation-mb</name>
    <value>19327</value>
  </property>

  <property>
    <description>The minimum allocation for every container request at the RM,
    in terms of virtual CPU cores. Requests lower than this won't take effect,
    and the specified value will get allocated the minimum.</description>
    <name>yarn.scheduler.minimum-allocation-vcores</name>
    <value>1</value>
  </property>

  <property>
    <description>Enable RM to recover state after starting. If true, then
      yarn.resourcemanager.store.class must be specified. </description>
    <name>yarn.resourcemanager.recovery.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Enable RM work preserving recovery.</description>
    <name>yarn.resourcemanager.work-preserving-recovery.enabled</name>
    <value>false</value>
  </property>

  <property>
    <description>Specify the auths to be used for the ACL's specified in both
    the yarn.resourcemanager.zk-acl and 
yarn.resourcemanager.zk-state-store.root-node.acl
    properties. This takes a comma-separated list of authentication mechanisms,
    each of the form 'scheme:auth' (the same syntax used for the 'addAuth'
    command in the ZK CLI).</description>
    <name>yarn.resourcemanager.zk-auth</name>
    <value></value>
  </property>

  <property>
    <description>URI pointing to the location of the FileSystem path where
    RM state will be stored. This must be supplied when using
    
org.apache.hadoop.yarn.server.resourcemanager.recovery.FileSystemRMStateStore
    as the value for yarn.resourcemanager.store.class</description>
    <name>yarn.resourcemanager.fs.state-store.uri</name>
    <value>/foobar/rmstore</value>
  </property>

  <property>
    <description>
      Enable RM high-availability. When enabled, (1) The RM starts in the
      Standby mode by default, and transitions to the Active mode when prompted
      to. (2) The nodes in the RM ensemble are listed in
      yarn.resourcemanager.ha.rm-ids (3) The id of each RM either comes from
      yarn.resourcemanager.ha.id if yarn.resourcemanager.ha.id is explicitly
      specified or can be figured out by matching 
yarn.resourcemanager.address.id
      with local address (4) The actual physical addresses come from the
      configs of the pattern - rpc-config.id
    </description>
    <name>yarn.resourcemanager.ha.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>
      Name of the cluster. In a HA setting, this is used to ensure the RM
      participates in leader election for this cluster and ensures it does
      not affect other clusters
    </description>
    <name>yarn.resourcemanager.cluster-id</name>
    <value>yarn-foobar-samza</value>
  </property>

  <property>
    <description>
      The hostname of the RM.
    </description>
    <name>yarn.resourcemanager.hostname</name>
    <value>foobar.com</value>
  </property>

  <property>
    <description>
      The id (string) of the current RM. When HA is enabled, this is an
      optional config. The id of current RM can be set by explicitly specifying
      yarn.resourcemanager.ha.id or figured out by matching
      yarn.resourcemanager.address.id with local address See description of
      yarn.resourcemanager.ha.enabled for full details on how this is used.
    </description>
    <name>yarn.resourcemanager.ha.id</name>
    <value></value>
  </property>

  <property>
    <description>
      The list of RM nodes in the cluster when HA is enabled. See description
      of yarn.resourcemanager.ha.enabled for full details on how this is used.
    </description>
    <name>yarn.resourcemanager.ha.rm-ids</name>
    <value>yarn-foobar-samza-rm-1,yarn-foobar-samza-rm-2</value>
  </property>

  <property>
    <description>
      The hostname of the RM with the specified ID.
    </description>
    <name>yarn.resourcemanager.hostname.yarn-foobar-samza-rm-1</name>
    <value>foobar.com</value>
  </property>

  <property>
    <description>
      The hostname of the RM with the specified ID.
    </description>
    <name>yarn.resourcemanager.hostname.yarn-foobar-samza-rm-2</name>
    <value>foobar.com</value>
  </property>

  <property>
    <description>
      Host:Port of the ZooKeeper server to be used by the RM. This must be
      supplied when using the ZooKeeper based implementation of the RM state
      store and/or embedded automatic failover in a HA setting.
    </description>
    <name>yarn.resourcemanager.zk-address</name>
    <value>zk-foobar.com:1234</value>
  </property>

  <property>
    <description>
      Full path of the ZooKeeper znode where RM state will be stored. This
      must be supplied when using 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
      as the value for yarn.resourcemanager.store.class.
      Due to YARN-3077, this path must be manually created.
    </description>
    <name>yarn.resourcemanager.zk-state-store.parent-path</name>
    <value>/foobar/yarn-foobar-samza</value>
  </property>

  <property>
    <description>
      Full path of the ZooKeeper znode where RM state will be stored. This
      must be supplied when using 
org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore
      as the value for yarn.resourcemanager.store.class.
      Due to YARN-3077, this path must be manually created.
    </description>
    <name>yarn.resourcemanager.zk-state-store.parent-path</name>
    <value>/foobar/yarn-foobar-samza/state</value>
  </property>

  <property>
    <description>
      The base znode path to use for storing leader information, when using
      ZooKeeper based leader election.
    </description>
    <name>yarn.resourcemanager.ha.automatic-failover.zk-base-path</name>
    <value>/foobar/yarn-foobar-samza/election</value>
  </property>

  <property>
    <description>
      The class to use as the persistent store. If
      org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore is
      used, the store is implicitly fenced; meaning a single ResourceManager is
      able to use the store at any point in time. More details on this implicit
      fencing, along with setting up appropriate ACLs is discussed under
      yarn.resourcemanager.zk-state-store.root-node.acl.
    </description>
    <name>yarn.resourcemanager.store.class</name>
    
<value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
  </property>

  <property>
    <description>
      Enable automatic failover. By default, it is enabled only when HA is
      enabled.
    </description>
    <name>yarn.resourcemanager.ha.automatic-failover.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>
      Enable embedded automatic failover. By default, it is enabled only when
      HA is enabled. The embedded elector relies on the RM state store to
      handle fencing, and is primarily intended to be used in conjunction with
      ZKRMStateStore.
    </description>
    <name>yarn.resourcemanager.ha.automatic-failover.embedded</name>
    <value>true</value>
  </property>

  <property>
    <description>who will execute(launch) the containers.</description>
    <name>yarn.nodemanager.container-executor.class</name>
    
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
  </property>

  <property>
    <description>
      Number of seconds after an application finishes before the nodemanager's
      DeletionService will delete the application's localized file directory
      and log directory.

      To diagnose Yarn application problems, set this property's value large
      enough (for example, to 600 = 10 minutes) to permit examination of these
      directories. After changing the property's value, you must restart the
      nodemanager in order for it to have an effect.

      The roots of Yarn applications' work directories is configurable with
      the yarn.nodemanager.local-dirs property (see below), and the roots
      of the Yarn applications' log directories is configurable with the
      yarn.nodemanager.log-dirs property (see also below).
    </description>
    <name>yarn.nodemanager.delete.debug-delay-sec</name>
    <value>64800</value>
  </property>

  <property>
    <description>It limits the maximum number of files which will be localized
      in a single local directory. If the limit is reached then sub-directories
      will be created and new files will be localized in them. If it is set to
      a value less than or equal to 36 [which are sub-directories (0-9 and then
      a-z)] then NodeManager will fail to start. For example; [for public
      cache] if this is configured with a value of 40 ( 4 files +
      36 sub-directories) and the local-dir is "/tmp/local-dir1" then it will
      allow 4 files to be created directly inside "/tmp/local-dir1/filecache".
      For files that are localized further it will create a sub-directory "0"
      inside "/tmp/local-dir1/filecache" and will localize files inside it
      until it becomes full. If a file is removed from a sub-directory that
      is marked full, then that sub-directory will be used back again to
      localize files.
   </description>
    <name>yarn.nodemanager.local-cache.max-files-per-directory</name>
    <value>8192</value>
  </property>

  <property>
    <description>
      Where to store container logs. An application's localized log directory
      will be found in yarn.nodemanager.log-dirs/application_appid.
      Individual containers' log directories will be below this, in directories
      named container_contid. Each container directory will contain the files
      stderr, stdin, and syslog generated by that container.
    </description>
    <name>yarn.nodemanager.log-dirs</name>
    <value>/foobar/logs/userlogs</value>
  </property>

  <property>
    <description>Whether to enable log aggregation</description>
    <name>yarn.log-aggregation-enable</name>
    <value>false</value>
  </property>

  <property>
    <description>How long to keep aggregation logs before deleting them.  -1 
disables.
    Be careful set this too small and you will spam the name node.</description>
    <name>yarn.log-aggregation.retain-seconds</name>
    <value>-1</value>
  </property>

  <property>
    <description>How long to wait between aggregated log retention checks.
    If set to 0 or a negative value then the value is computed as one-tenth
    of the aggregated log retention time. Be careful set this too small and
    you will spam the name node.</description>
    <name>yarn.log-aggregation.retain-check-interval-seconds</name>
    <value>-1</value>
  </property>

  <property>
    <description>Time in seconds to retain user logs. Only applicable if
    log aggregation is disabled
    </description>
    <name>yarn.nodemanager.log.retain-seconds</name>
    <value>604800</value>
  </property>

  <property>
    <description>Where to aggregate logs to.</description>
    <name>yarn.nodemanager.remote-app-log-dir</name>
    <value>/foobar/logs</value>
  </property>

  <property>
    <description>The remote log dir will be created at
      {yarn.nodemanager.remote-app-log-dir}/{user}/{thisParam}
    </description>
    <name>yarn.nodemanager.remote-app-log-dir-suffix</name>
    <value>logs</value>
  </property>

  <property>
    <description>Amount of physical memory, in MB, that can be allocated
    for containers.</description>
    <name>yarn.nodemanager.resource.memory-mb</name>
    <value>57864</value>
  </property>

  <property>
    <description>Whether physical memory limits will be enforced for
    containers.</description>
    <name>yarn.nodemanager.pmem-check-enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Whether virtual memory limits will be enforced for
    containers.</description>
    <name>yarn.nodemanager.vmem-check-enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Ratio between virtual memory to physical memory when
    setting memory limits for containers. Container allocations are
    expressed in terms of physical memory, and virtual memory usage
    is allowed to exceed this allocation by this ratio.
    </description>
    <name>yarn.nodemanager.vmem-pmem-ratio</name>
    <value>8</value>
  </property>

  <property>
    <description>Number of CPU cores that can be allocated
    for containers.</description>
    <name>yarn.nodemanager.resource.cpu-vcores</name>
    <value>96</value>
  </property>

  <property>
    <description>How often to monitor containers.</description>
    <name>yarn.nodemanager.container-monitor.interval-ms</name>
    <value>3000</value>
  </property>

  <property>
    <description>T-file compression types used to compress aggregated 
logs.</description>
    <name>yarn.nodemanager.log-aggregation.compression-type</name>
    <value>none</value>
  </property>

  <property>
    <description>No. of ms to wait between sending a SIGTERM and SIGKILL to a 
container</description>
    <name>yarn.nodemanager.sleep-delay-before-sigkill.ms</name>
    <value>1000</value>
  </property>

  <property>
    <description>Max time to wait for a process to come up when trying to 
cleanup a container</description>
    <name>yarn.nodemanager.process-kill-wait.ms</name>
    <value>10000</value>
  </property>

  <property>
    <description>Max number of threads in NMClientAsync to process container
    management events</description>
    <name>yarn.client.nodemanager-client-async.thread-pool-max-size</name>
    <value>500</value>
  </property>

  <property>
    <description>
        List of directories to store localized files in. An application's 
localized file directory will be found in: 
yarn.nodemanager.local-dirs/usercache/user/appcache/application_appid. 
Individual containers' work directories, called container_contid, will be 
subdirectories of this.
    </description>
    <name>yarn.nodemanager.local-dirs</name>
    <value>/foobar</value>
  </property>

  <property>
    <description>
      When HA is enabled, the class to be used by Clients, AMs and NMs to
      failover to the Active RM. It should extend
      org.apache.hadoop.yarn.client.RMFailoverProxyProvider. This is an optional
      configuration. The default value is 
“org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider”
    </description>
    <name>yarn.client.failover-proxy-provider</name>
    
<value>org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider</value>
  </property>

  <property>
    <description>who will execute(launch) the containers.</description>
    <name>yarn.nodemanager.container-executor.class</name>
    
<value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
  </property>

  <property>
    <description>The class which should help the LCE handle
    resources.</description>
    
<name>yarn.nodemanager.linux-container-executor.resources-handler.class</name>
    
<value>org.apache.hadoop.yarn.server.nodemanager.util.CgroupsLCEResourcesHandler</value>
  </property>

  <property>
    <description>The cgroups hierarchy under which to place YARN proccesses
    (cannot contain commas). If 
yarn.nodemanager.linux-container-executor.cgroups.mount
    is false (that is, if cgroups have been pre-configured), then this cgroups
    hierarchy must already exist and be writable by the NodeManager user,
    otherwise the NodeManager may fail. Only used when the LCE resources
    handler is set to the CgroupsLCEResourcesHandler.</description>
    <name>yarn.nodemanager.linux-container-executor.cgroups.hierarchy</name>
    <value>/restrain.slice</value>
  </property>

  <property>
    <description>Whether the LCE should attempt to mount cgroups if not found.
    Only used when the LCE resources handler is set to the
    CgroupsLCEResourcesHandler.</description>
    <name>yarn.nodemanager.linux-container-executor.cgroups.mount</name>
    <value>false</value>
  </property>

  <property>
    <description>Where the LCE should attempt to mount cgroups if not found.
    Common locations include /sys/fs/cgroup and /cgroup; the default location
    can vary depending on the Linux distribution in use. This path must exist
    before the NodeManager is launched. Only used when the LCE resources
    handler is set to the CgroupsLCEResourcesHandler, and 
yarn.nodemanager.linux-container-executor.cgroups.mount
    is true.</description>
    <name>yarn.nodemanager.linux-container-executor.cgroups.mount-path</name>
    <value>/cgroup</value>
  </property>

  <property>
    <description>This determines which of the two modes that LCE should use on
    a non-secure cluster. If this value is set to true, then all containers
    will be launched as the user specified in 
yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user.
    If this value is set to false, then containers will run as the user who
    submitted the application.</description>
    
<name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users</name>
    <value>true</value>
  </property>

  <property>
    <description>The UNIX user that containers will run as when
    Linux-container-executor is used in nonsecure mode (a use case for this is
    using cgroups) if the 
yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users
    is set to true.</description>
    
<name>yarn.nodemanager.linux-container-executor.nonsecure-mode.local-user</name>
    <value>foobar</value>
  </property>

  <property>
    <description>This flag determines whether apps should run with strict
    resource limits or be allowed to consume spare resources if they need them.
    For example, turning the flag on will restrict apps to use only their share
    of CPU, even if the node has spare CPU cycles. The default value is false
    i.e. use available resources. Please note that turning this flag on may
    reduce job throughput on the cluster.</description>
    
<name>yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage</name>
    <value>false</value>
  </property>

  <property>
    <description>The Unix group of the NodeManager. It should match the
    setting in “container-executor.cfg”. This configuration is required
    for validating the secure access of the container-executor
    binary.</description>
    <name>yarn.nodemanager.linux-container-executor.group</name>
    <value>foobar</value>
  </property>

  <property>
    <description>This setting lets you limit the cpu usage of all YARN
    containers. It sets a hard upper limit on the cumulative CPU usage of the
    containers. For example, if set to 60, the combined CPU usage of all YARN
    containers will not exceed 60%.</description>
    <name>yarn.nodemanager.resource.percentage-physical-cpu-limit</name>
    <value>90</value>
  </property>

  <property>
    <description>The hostname of the Timeline service web 
application.</description>
    <name>yarn.timeline-service.hostname</name>
    <value>foobar.com</value>
  </property>

  <property>
    <description>Handler thread count to serve the client RPC 
requests.</description>
    <name>yarn.timeline-service.handler-thread-count</name>
    <value>10</value>
  </property>

  <property>
    <description>Enables cross-origin support (CORS) for web services where
    cross-origin web response headers are needed. For example, javascript making
    a web services request to the timeline server.</description>
    <name>yarn.timeline-service.http-cross-origin.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Comma separated list of origins that are allowed for web
    services needing cross-origin (CORS) support. Wildcards (*) and patterns
    allowed</description>
    <name>yarn.timeline-service.http-cross-origin.allowed-origins</name>
    <value>*</value>
  </property>

  <property>
    <description>Comma separated list of methods that are allowed for web
    services needing cross-origin (CORS) support.</description>
    <name>yarn.timeline-service.http-cross-origin.allowed-methods</name>
    <value>GET,POST,HEAD</value>
  </property>

  <property>
    <description>Comma separated list of headers that are allowed for web
    services needing cross-origin (CORS) support.</description>
    <name>yarn.timeline-service.http-cross-origin.allowed-headers</name>
    <value>X-Requested-With,Content-Type,Accept,Origin</value>
  </property>

  <property>
    <description>The number of seconds a pre-flighted request can be cached
    for web services needing cross-origin (CORS) support.</description>
    <name>yarn.timeline-service.http-cross-origin.max-age</name>
    <value>1800</value>
  </property>

  <property>
    <description>Indicate to ResourceManager as well as clients whether
    history-service is enabled or not. If enabled, ResourceManager starts
    recording historical data that Timeliene service can consume. Similarly,
    clients can redirect to the history service when applications
    finish if this is enabled.</description>
    <name>yarn.timeline-service.generic-application-history.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Store class name for history store, defaulting to file system
    store</description>
    <name>yarn.timeline-service.generic-application-history.store-class</name>
    
<value>org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore</value>
  </property>

  <property>
    <description>Indicate to clients whether Timeline service is enabled or not.
    If enabled, the TimelineClient library used by end-users will post entities
    and events to the Timeline server.</description>
    <name>yarn.timeline-service.enabled</name>
    <value>true</value>
  </property>

  <property>
    <description>Store class name for timeline store.</description>
    <name>yarn.timeline-service.store-class</name>
    <value>org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore</value>
  </property>

  <property>
    <description>Enable age off of timeline store data.</description>
    <name>yarn.timeline-service.ttl-enable</name>
    <value>true</value>
  </property>

  <property>
    <description>Time to live for timeline store data in 
milliseconds.</description>
    <name>yarn.timeline-service.ttl-ms</name>
    <value>604800000</value>
  </property>

</configuration>
{noformat}

> Upgrade Samza to YARN 2.6.0
> ---------------------------
>
>                 Key: SAMZA-563
>                 URL: https://issues.apache.org/jira/browse/SAMZA-563
>             Project: Samza
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 0.9.0
>            Reporter: Chris Riccomini
>            Assignee: Aleksandar Pejakovic
>         Attachments: HELLO-SAMZA-563-2.6.0.patch, HELLO-SAMZA-563.0.patch, 
> SAMZA-563-2.6.0.patch, SAMZA-563.0.patch, SAMZA-563.2.patch
>
>
> Samza is currently running on YARN 2.4.0. We should upgrade it to YARN 2.6.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to