Re: Request double-check on Ambari config logic (ES network_host)

Matt Foley Tue, 02 May 2017 22:50:46 -0700

Okay, several items that merit discussion:

Fact A. Experiment shows that the contents of the <value> fields in 
elastic-site.xml, and hence the values in Ambari GUI config fields, are just 
used as big unquoted Unicode character sequences, including any quote marks, 
square brackets or other punctuation, until they are written into the yaml.j2 
template by the {{ }} operator.  Thus, the value:
    ["_eth0_","_lo_"]
is a 16-character Unicode string.  Yaml, of course, actually parses the result.
This is actually nice, it makes it easy to understand and manipulate the 
textual content of the field.


Fact B. In the Hadoop world, config parameters that are lists, are usually 
single strings containing a sequence of unquoted comma-delimited substrings 
with no blank spaces.  The substring elements of the list are forbidden to have 
commas or anything else that would disrupt fairly obvious parsing.  Parsing is 
done by apache commons code or plain old Java.  Users are USED to working with 
these kinds of config params in Ambari.

But in Elasticsearch, and some other Metron components, the parsing is done by 
Yaml.  This means:
-    To be a list, square brackets must be provided – either in the value, the 
python processing, or the template.  If only one value is provided it does not 
have to be in a list.
-    List elements want to be delimited by comma-space, not just comma 
(although it’s not clear whether this actually causes errors with non-numeric 
list elements)
-    Quote marks around string list elements are optional except when 
necessary.  This greatly increases the opportunity for confusion and error.
-    Colon is a special character (related to dictionary parsing), so if you 
need a colon in a string, the string needs quote marks.  “_local_” doesn’t need 
quote marks; “_local:ipv4_” does require quote marks.  Character sequences that 
would mis-parse as poorly formed numbers also need quote marks: “0.0.0.0”.

Fact C. The “network.host” Elasticsearch parameter is a cheat, both way more 
powerful and way more limited than one might expect.
It is a cheat because it masks two underlying parameters: network.bind_host and 
network.publish_host.  This is all documented at 
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/modules-network.html
 and implemented in 
https://github.com/elastic/elasticsearch/blob/2.3/core/src/main/java/org/elasticsearch/common/network/NetworkService.java
 (methods resolveBindHostAddresses() and resolvePublishHostAddresses()). 
-    network.bind_host is the set of addresses Elasticsearch “bind to” (listens 
on). Supposedly it will actually bind to multiple network addresses if 
available and specified.  Whatever set of specifiers you gave network.host get 
expanded into a list of actual bind addresses.  If you give it the wildcard 
value (“0.0.0.0” for ipv4), it will bind to all available addresses.
-    network.publish_host is the address Elasticsearch “publishes” for clients 
and other servers to connect to. It will publish only one address.  If you give 
it a set of addresses, it picks the most “desirable” of the set – it assures it 
actually is accessible, and it prefers ipv4 (or 6, depending on another 
config), then  global, then site-local, then link-local, then loopback. Within 
each category it orders by numeric magnitude of the IP address, which is hardly 
meaningful.  This means the published address can be wrong on a multi-homed 
server or VM, if you don’t appropriately constrain it.  
-    The parameter values can be network addresses, network interface names, 
host names (to be dereferenced via DNS), “special” names denoting predefined 
sets of addresses, and combinations of the above.
-    Wildcard and loopback addresses are allowed.  
-    If the wildcard is provided it must be the ONLY value provided (list of 
length == 1), or ES will throw an error.

Discussion item 1:  If you use network.host, the same list of addresses get 
sent to both network.bind_host and network.publish_host.  The algorithm for 
picking the single publish_host address is not good enough, at least in ES 2.3, 
to give certainty that the right address will be published, on multi-homed 
servers or VMs (although on non-multi-homed, it should generally work fine).

It seems to me that specifying exactly one of _local_, _site_, or _global_ will 
usually give the right result, but that too can fail if the server has multiple 
addresses within the same category.

I think network.bind_host and network.publish_host should be separately 
configured, as they are with Hadoop.
There’s an article here: 
https://community.hortonworks.com/content/kbentry/24277/parameters-for-multi-homing.html
that discusses these issues at some length, and clarifies why they must be 
separately configured.

What do you-all think?

Discussion item 2:  While it’s fine to use 0.0.0.0 for the bind address, it 
gives no guidance at all to the needed publish_host value. Using _local_ for 
QuickDev and single-node deployments, and _site_  for FullDev deployments and 
all cluster deployments, is probably a reasonable choice for publish_host.

What do you-all think?

Discussion item 3: Should we attempt to further the “hadoop style” of config 
parameter, and silently add the square brackets and perhaps substring quotes in 
python processing?  Or should we say users need to understand ES configuration, 
and tell them to put the list in square brackets themselves, if they need a 
list entry in this parameter, per 
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/modules-network.html
 ?

Please share your thoughts,
Thanks,
--Matt


On 5/2/17, 9:57 PM, "Matt Foley" <mfo...@hortonworks.com> wrote:

    Hi Otto,
    This event derives from this line of code: 
https://github.com/elastic/elasticsearch/blob/2.3/core/src/main/java/org/elasticsearch/action/support/master/TransportMasterNodeAction.java#L148
    which suggests that a cluster action has been requested on a local 
(loopback) address.  This is not
    surprising given what I’ve learned about the semantics of network.host with 
wildcard address.
    See next message, item C.  Basically, while the wildcard causes ES to 
“listen” on all IP addresses, it
    only *publishes* one, and on a multi-homed server it can be the wrong one.  
I can’t be certain
    this causes what you’re seeing, but it seems feasible.
    
    From: Otto Fowler <ottobackwa...@gmail.com>
    Date: Tuesday, May 2, 2017 at 8:30 PM
    To: "d...@metron.incubator.apache.org" <d...@metron.incubator.apache.org>, 
Matt Foley <mfo...@hortonworks.com>, "dev@metron.apache.org" 
<dev@metron.apache.org>, "zeo...@gmail.com" <zeo...@gmail.com>
    Subject: Re: Request double-check on Ambari config logic (ES network_host)
    
    OK.
    I tried it using this method, and master ( adding [] ).  In both cases, I 
can hit 9200 from other machines, but in both cases I’m getting ES master 
errors:
    
    ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not 
recovered / initialized];]
    at 
org.elasticsearch.cluster.block.ClusterBlocks.indexBlockedException(ClusterBlocks.java:174)
    at 
org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:66)
    at 
org.elasticsearch.action.admin.indices.create.TransportCreateIndexAction.checkBlock(TransportCreateIndexAction.java:41)
    at 
org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:148)
    at 
org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:140)
    at 
org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:107)
    at 
org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:51)
    at 
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
    at 
org.elasticsearch.action.index.TransportIndexAction.doExecute(TransportIndexAction.java:98)
    at 
org.elasticsearch.action.index.TransportIndexAction.doExecute(TransportIndexAction.java:66)
    at 
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:137)
    at 
org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
    at 
org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at org.elasticsearch.client.FilterClient.doExecute(FilterClient.java:52)
    at 
org.elasticsearch.rest.BaseRestHandler$HeadersAndContextCopyClient.doExecute(BaseRestHandler.java:83)
    at 
org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
    at 
org.elasticsearch.client.support.AbstractClient.index(AbstractClient.java:371)
    at 
org.elasticsearch.rest.action.index.RestIndexAction.handleRequest(RestIndexAction.java:102)
    at 
org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:54)
    at 
org.elasticsearch.rest.RestController.executeHandler(RestController.java:205)
    at 
org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:166)
    at 
org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:128)
    at 
org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:86)
    at 
org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServ
    
    and kibana is not good.
    
    not sure what that error means.
    I have 5 nodes, and put es master on #5, with #3,4 as datanodes.
    
    Sorry, but I don’t think my setup is going to be much help at this point.
    
    
    
    
    On May 2, 2017 at 17:19:43, Matt Foley 
(mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>) wrote:
    The default will now be “0.0.0.0”, and not eth0. And this will work if 
suggestions from various community members and a suggestion in the old 1.x 
documentation for ES are correct. The 2.x documentation (we specify ES 2.3) 
doesn’t mention “0.0.0.0”, but I think it’s likely to still work, but it needs 
testing.
    
    Thanks,
    --Matt
    
    From: Otto Fowler <ottobackwa...@gmail.com<mailto:ottobackwa...@gmail.com>>
    Date: Tuesday, May 2, 2017 at 11:27 AM
    To: 
"d...@metron.incubator.apache.org<mailto:d...@metron.incubator.apache.org>" 
<d...@metron.incubator.apache.org<mailto:d...@metron.incubator.apache.org>>, 
Matt Foley <mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>>, 
"dev@metron.apache.org<mailto:dev@metron.apache.org>" 
<dev@metron.apache.org<mailto:dev@metron.apache.org>>, "zeo...@gmail.com" 
<zeo...@gmail.com<mailto:zeo...@gmail.com>>
    Subject: Re: Request double-check on Ambari config logic (ES network_host)
    
    Are you saying that the defaults should work now?
    Or they should work, but I still need to change the interface from eth0?
    
    
    
    
    On May 2, 2017 at 13:36:11, Matt Foley 
(mfo...@hortonworks.com<mailto:mfo...@hortonworks.com><mailto:mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>>)
 wrote:
    Hi Otto,
    The basic change to use “0.0.0.0” as the default binding, and put the 
square brackets in the template text instead of the parameter value, is now 
available in
    https://github.com/mattf-horton/incubator-metron branch METRON-905 commit 
e879719a0c3fb
    
    I’m having some trouble with my test env, so if you wanted to give it a 
try, that would be great.
    If the “0.0.0.0” doesn’t work, then we should use
    "_local_", "_site_"
    that being the ES special values that mean aprx the same.
    
    I’m going to have to do trial-and-error to determine the exact behavior of 
multi-item lists, and then write the python code to strip redundant square 
brackets if included in the parameter value.
    Thanks,
    --Matt
    
    
    On 5/2/17, 6:44 AM, "Otto Fowler" 
<ottobackwa...@gmail.com<mailto:ottobackwa...@gmail.com><mailto:ottobackwa...@gmail.com<mailto:ottobackwa...@gmail.com>>>
 wrote:
    
    I am working on a centos 7 cluster deploy for testing the steps.
    I have this issue ( along with the wrong interface name ) and can test when
    you have it.
    
    An eta would help?
    
    
    On May 2, 2017 at 09:14:10, zeo...@gmail.com 
(zeo...@gmail.com<mailto:zeo...@gmail.com><mailto:zeo...@gmail.com<mailto:zeo...@gmail.com>>)
 wrote:
    
    Are you working on this one? The JIRA doesn't look like it's currently
    assigned. Thanks,
    
    Jon
    
    On Mon, May 1, 2017 at 6:40 PM Matt Foley 
<mfo...@hortonworks.com<mailto:mfo...@hortonworks.com><mailto:mfo...@hortonworks.com<mailto:mfo...@hortonworks.com>>>
 wrote:
    
    > Ah, I see I mis-read METRON-897, and Nick specifically says
    > "lo:ipv4","eth0:ipv4" did not work for him, but
    ["_lo:ipv4_","_eth0:ipv4_"]
    > did work.
    >
    > So I went back and dug a little deeper, and realized that in the
    > environment where "lo:ipv4","eth0:ipv4" worked for me, I had modified the
    > yaml.j2 template to include the square brackets.
    >
    > So the below theory is wrong. Back to the drawing board.
    > Thanks,
    > --Matt
    >
    > On 5/1/17, 3:08 PM, "Matt Foley" 
<ma...@apache.org<mailto:ma...@apache.org><mailto:ma...@apache.org<mailto:ma...@apache.org>>>
 wrote:
    >
    > Hi, there have been widely varying statements about what needs to be
    > in the Elasticsearch config parameter “network_host”. I think I may have
    a
    > rationale for what works and what doesn’t, but I’d like your input or
    > correction.
    >
    > I am focusing on what worked in terms of punctuation (quotes and
    > square brackets) with the old _lo:ip4_,_eth0:ip4_. I would like to ignore
    > for the moment, please, whether eth0 was the correct name for a given
    env,
    > and whether we can use 0.0.0.0. Instead, for systems where eth0 WAS the
    > correct name, I’d like to understand what worked and why.
    >
    > It’s complicated because the value starts out in xml, is read into
    > python, printed by jinja, then consumed by yaml.
    >
    > I think there were two constructs that actually worked for this
    > param. Please say whether this is consistent or inconsistent with your
    > experience:
    >
    > "_lo:ip4_","_eth0:ip4_"
    > This worked for me. I think this was read from XML into python as a
    > list of strings, then output in jinja ‘print statement‘
    > {{ network_host }} as a python literal list with form:
    > [ "_lo:ip4_", "_eth0:ip4_" ]
    > In other words, the print statement for a python list object injected
    > the needed square brackets.
    >
    > and
    > "[ _lo:ip4_, _eth0:ip4_ ]"
    > Nick and Anand, please confirm if this is the form that worked for
    > you. I think this was read from XML into python as a single string, and
    > output in the same jinja print statement as:
    > [ _lo:ip4_, _eth0:ip4_ ]
    > because the print statement for a python string object does not
    > produce quote marks.
    >
    > In either case, yaml (the consumer of the jinja output) saw what it
    > interprets as a list of strings (since quotes are optional for yaml
    > strings).
    >
    > What didn’t work was:
    >
    > * "_lo:ip4_, _eth0:ip4_"
    > This would be read in and output as a single string, and no square
    > brackets would ever be introduced.
    >
    > * _lo:ip4_, _eth0:ip4_ or [ _lo:ip4_, _eth0:ip4_ ]
    > (without quotes) I think the unquoted colons messed up the python
    > parsing
    >
    > Finally, I don’t know whether
    > * [ "_lo:ip4_", "_eth0:ip4_" ]
    > worked or not, I’m not sure anyone ever tried it. By the above logic
    > it probably should work.
    >
    > Please give me your input if you have touched on these issues.
    > Thanks,
    > --Matt
    >
    >
    >
    >
    >
    >
    > --
    
    Jon

Re: Request double-check on Ambari config logic (ES network_host)

Reply via email to