Sounds good. Thanks. On Wed, Oct 21, 2015 at 10:31 AM, Andrew Lee <[email protected]> wrote:
> Hi Xuefu, > > https://issues.apache.org/jira/browse/HIVE-12222 > > created. Please advise if the subject and the fields are appropriate and > feel free to update them to make it more standard for the community. I'll > follow up in that JIRA ticket for discussion, thanks. > > ________________________________________ > From: Andrew Lee > Sent: Wednesday, October 21, 2015 10:25 AM > To: [email protected] > Subject: Re: Hard Coded 0 to assign RPC Server port number when > hive.execution.engine=spark > > Hi Xuefu, > > Thanks, I'll create a JIRA, by the way, since HiveCLI will be replaced by > beeline or other design later, > I'm hoping the same philosophy can be considered if other CLI is using > RPCServer as well or sharing the same source code at some point. > > Shall the Issue Type of the JIRA ticket be "Improvement" or "New Feature" ? > > ________________________________________ > From: Xuefu Zhang <[email protected]> > Sent: Tuesday, October 20, 2015 6:39 PM > To: [email protected] > Subject: Re: Hard Coded 0 to assign RPC Server port number when > hive.execution.engine=spark > > Thanks, Andrew! You have a point. However, we're trying to sunset Hive CLI. > In the meantime, I guess it doesn't hurt to give admin more control over > the ports to be used. Please put your proposal in a JIRA and we can go from > there. > > --Xuefu > > On Tue, Oct 20, 2015 at 7:54 AM, Andrew Lee <[email protected]> wrote: > > > Hi Xuefu, > > > > 2 Main reasons. > > > > - Most users (what I see and encounter) use HiveCLI as a command line > > tool, and in order to use that, they need to login to the edge node (via > > SSH). Now, here comes the interesting part. > > Could be true or not, but this is what I observe and encounter from time > > to time. Most users will abuse the resource on that edge node (increasing > > HADOOP_HEAPSIZE, dumping output to local disk, running huge python > > workflow, etc), this may cause the HS2 process to run into OOME, choke > and > > die, etc. various resource issues including others like login, etc. > > > > - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly > > available. This makes sense to run it on the gateway node or a service > node > > and separated from the HiveCLI. > > The logs are located in different location, monitoring and auditing is > > easier to run HS2 with a daemon user account, etc. so we don't want users > > to run HiveCLI where HS2 is running. > > It's better to isolate the resource this way to avoid any memory, file > > handlers, disk space, issues. > > > > From a security standpoint, > > > > - Since users can login to edge node (via SSH), the security on the edge > > node needs to be fortified and enhanced. Therefore, all the FW comes in > and > > auditing. > > > > - Regulation/compliance for auditing is another requirement to monitor > all > > traffic, specifying ports and locking down the ports makes it easier > since > > we can focus > > on a range to monitor and audit. > > > > Hope this explains the reason why we are asking for this feature. > > > > > > ________________________________________ > > From: Xuefu Zhang <[email protected]> > > Sent: Monday, October 19, 2015 9:37 PM > > To: [email protected] > > Subject: Re: Hard Coded 0 to assign RPC Server port number when > > hive.execution.engine=spark > > > > Hi Adrew, > > > > I understand your policy on edge node. However, I'm wondering why you > > cannot require that Hive CLI run only on gateway nodes, similar to HS2? > In > > essence, Hive CLI is a client with embedded hive server, so it seems > > reasonable to have a similar requirement as it for HS2. > > > > I'm not defending against your request. Rather, I'm interested in the > > rationale behind your policy. > > > > Thanks, > > Xuefu > > > > On Mon, Oct 19, 2015 at 9:12 PM, Andrew Lee <[email protected]> wrote: > > > > > Hi Xuefu, > > > > > > I agree for HS2 since HS2 usually runs on a gateway or service node > > inside > > > the cluster environment. > > > In my case, it is actually additional security. > > > A separate edge node (not running HS2, HS2 runs on another box) is used > > > for HiveCLI. > > > We don't allow data/worker nodes to talk to the edge node on random > > ports. > > > All ports must be registered or explicitly specified and monitored. > > > That's why I am asking for this feature. Otherwise, opening up > 1024-65535 > > > from data/worker node to edge node is actually > > > a bad idea and bad practice for network security. :( > > > > > > > > > > > > ________________________________________ > > > From: Xuefu Zhang <[email protected]> > > > Sent: Monday, October 19, 2015 1:12 PM > > > To: [email protected] > > > Subject: Re: Hard Coded 0 to assign RPC Server port number when > > > hive.execution.engine=spark > > > > > > Hi Andrew, > > > > > > RpcServer is an instance launched for each user session. In case of > Hive > > > CLI, which is for a single user, what you said makes sense and the port > > > number can be configurable. In the context of HS2, however, there are > > > multiple user sessions and the total is unknown in advance. While +1 > > scheme > > > works, there can be still a band of ports that might be eventually > > opened. > > > > > > On a different perspective, we expect that either Hive CLI or HS2 > resides > > > on a gateway node, which are in the same network with the data/worker > > > nodes. In this configuration, firewall issue you mentioned doesn't > apply. > > > Such configuration is what we usually see in our enterprise customers, > > > which is what we recommend. I'm not sure why you would want your Hive > > users > > > to launch Hive CLI anywhere outside your cluster, which doesn't seem > > secure > > > if security is your concern. > > > > > > Thanks, > > > Xuefu > > > > > > On Mon, Oct 19, 2015 at 7:20 AM, Andrew Lee <[email protected]> > wrote: > > > > > > > Hi All, > > > > > > > > > > > > I notice that in > > > > > > > > > > > > > > > > > > > > > > ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java > > > > > > > > > > > > The port number is assigned with 0 which means it will be a random > port > > > > every time when the RPC Server is created > > > > > > > > to talk to Spark in the same session. > > > > > > > > > > > > Any reason why this port number is not a property to be configured > and > > > > follow the same rule as +1 if the port is taken? > > > > > > > > Just like Spark's configuration for Spark Driver, etc.? Because of > > this, > > > > this is causing problems to configure firewall between the > > > > > > > > HiveCLI RPC Server and Spark due to unpredictable port numbers here. > In > > > > other word, users need to open all hive ports range > > > > > > > > from Data Node => HiveCLI (edge node). > > > > > > > > > > > > this.channel = new ServerBootstrap() > > > > .group(group) > > > > .channel(NioServerSocketChannel.class) > > > > .childHandler(new ChannelInitializer<SocketChannel>() { > > > > @Override > > > > public void initChannel(SocketChannel ch) throws Exception > { > > > > SaslServerHandler saslHandler = new > > > SaslServerHandler(config); > > > > final Rpc newRpc = Rpc.createServer(saslHandler, config, > > ch, > > > > group); > > > > saslHandler.rpc = newRpc; > > > > > > > > Runnable cancelTask = new Runnable() { > > > > @Override > > > > public void run() { > > > > LOG.warn("Timed out waiting for hello from > client."); > > > > newRpc.close(); > > > > } > > > > }; > > > > saslHandler.cancelTask = group.schedule(cancelTask, > > > > RpcServer.this.config.getServerConnectTimeoutMs(), > > > > TimeUnit.MILLISECONDS); > > > > > > > > } > > > > }) > > > > .option(ChannelOption.SO_BACKLOG, 1) > > > > .option(ChannelOption.SO_REUSEADDR, true) > > > > .childOption(ChannelOption.SO_KEEPALIVE, true) > > > > .bind(0) > > > > .sync() > > > > .channel(); > > > > this.port = ((InetSocketAddress) > channel.localAddress()).getPort(); > > > > > > > > > > > > Appreciate any feedback, and if a JIRA is required to keep track of > > this > > > > conversation. Thanks. > > > > > > > > > > > > > >
