Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

Xuefu Zhang Wed, 21 Oct 2015 11:00:42 -0700

Sounds good. Thanks.

On Wed, Oct 21, 2015 at 10:31 AM, Andrew Lee <[email protected]> wrote:


> Hi Xuefu,
>
> https://issues.apache.org/jira/browse/HIVE-12222
>
> created. Please advise if the subject and the fields are appropriate and
> feel free to update them to make it more standard for the community. I'll
> follow up in that JIRA ticket for discussion, thanks.
>
> ________________________________________
> From: Andrew Lee
> Sent: Wednesday, October 21, 2015 10:25 AM
> To: [email protected]
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Hi Xuefu,
>
> Thanks, I'll create a JIRA, by the way, since HiveCLI will be replaced by
> beeline or other design later,
> I'm hoping the same philosophy can be considered if other CLI is using
> RPCServer as well or sharing the same source code at some point.
>
> Shall the Issue Type of the JIRA ticket be "Improvement" or "New Feature" ?
>
> ________________________________________
> From: Xuefu Zhang <[email protected]>
> Sent: Tuesday, October 20, 2015 6:39 PM
> To: [email protected]
> Subject: Re: Hard Coded 0 to assign RPC Server port number when
> hive.execution.engine=spark
>
> Thanks, Andrew! You have a point. However, we're trying to sunset Hive CLI.
> In the meantime, I guess it doesn't hurt to give admin more control over
> the ports to be used. Please put your proposal in a JIRA and we can go from
> there.
>
> --Xuefu
>
> On Tue, Oct 20, 2015 at 7:54 AM, Andrew Lee <[email protected]> wrote:
>
> > Hi Xuefu,
> >
> > 2 Main reasons.
> >
> > - Most users (what I see and encounter) use HiveCLI as a command line
> > tool, and in order to use that, they need to login to the edge node (via
> > SSH). Now, here comes the interesting part.
> > Could be true or not, but this is what I observe and encounter from time
> > to time. Most users will abuse the resource on that edge node (increasing
> > HADOOP_HEAPSIZE, dumping output to local disk, running huge python
> > workflow, etc), this may cause the HS2 process to run into OOME, choke
> and
> > die, etc. various resource issues including others like login, etc.
> >
> > - Analyst connects to Hive via HS2 + ODBC. So HS2 needs to be highly
> > available. This makes sense to run it on the gateway node or a service
> node
> > and separated from the HiveCLI.
> > The logs are located in different location, monitoring and auditing is
> > easier to run HS2 with a daemon user account, etc. so we don't want users
> > to run HiveCLI where HS2 is running.
> > It's better to isolate the resource this way to avoid any memory, file
> > handlers, disk space, issues.
> >
> > From a security standpoint,
> >
> > - Since users can login to edge node (via SSH), the security on the edge
> > node needs to be fortified and enhanced. Therefore, all the FW comes in
> and
> > auditing.
> >
> > - Regulation/compliance for auditing is another requirement to monitor
> all
> > traffic, specifying ports and locking down the ports makes it easier
> since
> > we can focus
> > on a range to monitor and audit.
> >
> > Hope this explains the reason why we are asking for this feature.
> >
> >
> > ________________________________________
> > From: Xuefu Zhang <[email protected]>
> > Sent: Monday, October 19, 2015 9:37 PM
> > To: [email protected]
> > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > hive.execution.engine=spark
> >
> > Hi Adrew,
> >
> > I understand your policy on edge node. However, I'm wondering why you
> > cannot require that Hive CLI run only on gateway nodes, similar to HS2?
> In
> > essence, Hive CLI is a client with embedded hive server, so it seems
> > reasonable to have a similar requirement as it for HS2.
> >
> > I'm not defending against your request. Rather, I'm interested in the
> > rationale behind your policy.
> >
> > Thanks,
> > Xuefu
> >
> > On Mon, Oct 19, 2015 at 9:12 PM, Andrew Lee <[email protected]> wrote:
> >
> > > Hi Xuefu,
> > >
> > > I agree for HS2 since HS2 usually runs on a gateway or service node
> > inside
> > > the cluster environment.
> > > In my case, it is actually additional security.
> > > A separate edge node (not running HS2, HS2 runs on another box) is used
> > > for HiveCLI.
> > > We don't allow data/worker nodes to talk to the edge node on random
> > ports.
> > > All ports must be registered or explicitly specified and monitored.
> > > That's why I am asking for this feature. Otherwise, opening up
> 1024-65535
> > > from data/worker node to edge node is actually
> > > a bad idea and bad practice for network security.  :(
> > >
> > >
> > >
> > > ________________________________________
> > > From: Xuefu Zhang <[email protected]>
> > > Sent: Monday, October 19, 2015 1:12 PM
> > > To: [email protected]
> > > Subject: Re: Hard Coded 0 to assign RPC Server port number when
> > > hive.execution.engine=spark
> > >
> > > Hi Andrew,
> > >
> > > RpcServer is an instance launched for each user session. In case of
> Hive
> > > CLI, which is for a single user, what you said makes sense and the port
> > > number can be configurable. In the context of HS2, however, there are
> > > multiple user sessions and the total is unknown in advance. While +1
> > scheme
> > > works, there can be still a band of ports that might be eventually
> > opened.
> > >
> > > On a different perspective, we expect that either Hive CLI or HS2
> resides
> > > on a gateway node, which are in the same network with the data/worker
> > > nodes. In this configuration, firewall issue you mentioned doesn't
> apply.
> > > Such configuration is what we usually see in our enterprise customers,
> > > which is what we recommend. I'm not sure why you would want your Hive
> > users
> > > to launch Hive CLI anywhere outside your cluster, which doesn't seem
> > secure
> > > if security is your concern.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > > On Mon, Oct 19, 2015 at 7:20 AM, Andrew Lee <[email protected]>
> wrote:
> > >
> > > > Hi All,
> > > >
> > > >
> > > > I notice that in
> > > >
> > > >
> > > >
> > > >
> > >
> >
> ./spark-client/src/main/java/org/apache/hive/spark/client/rpc/RpcServer.java
> > > >
> > > >
> > > > The port number is assigned with 0 which means it will be a random
> port
> > > > every time when the RPC Server is created
> > > >
> > > > to talk to Spark in the same session.
> > > >
> > > >
> > > > Any reason why this port number is not a property to be configured
> and
> > > > follow the same rule as +1 if the port is taken?
> > > >
> > > > Just like Spark's configuration for Spark Driver, etc.?  Because of
> > this,
> > > > this is causing problems to configure firewall between the
> > > >
> > > > HiveCLI RPC Server and Spark due to unpredictable port numbers here.
> In
> > > > other word, users need to open all hive ports range
> > > >
> > > > from Data Node => HiveCLI (edge node).
> > > >
> > > >
> > > >  this.channel = new ServerBootstrap()
> > > >       .group(group)
> > > >       .channel(NioServerSocketChannel.class)
> > > >       .childHandler(new ChannelInitializer<SocketChannel>() {
> > > >           @Override
> > > >           public void initChannel(SocketChannel ch) throws Exception
> {
> > > >             SaslServerHandler saslHandler = new
> > > SaslServerHandler(config);
> > > >             final Rpc newRpc = Rpc.createServer(saslHandler, config,
> > ch,
> > > > group);
> > > >             saslHandler.rpc = newRpc;
> > > >
> > > >             Runnable cancelTask = new Runnable() {
> > > >                 @Override
> > > >                 public void run() {
> > > >                   LOG.warn("Timed out waiting for hello from
> client.");
> > > >                   newRpc.close();
> > > >                 }
> > > >             };
> > > >             saslHandler.cancelTask = group.schedule(cancelTask,
> > > >                 RpcServer.this.config.getServerConnectTimeoutMs(),
> > > >                 TimeUnit.MILLISECONDS);
> > > >
> > > >           }
> > > >       })
> > > >       .option(ChannelOption.SO_BACKLOG, 1)
> > > >       .option(ChannelOption.SO_REUSEADDR, true)
> > > >       .childOption(ChannelOption.SO_KEEPALIVE, true)
> > > >       .bind(0)
> > > >       .sync()
> > > >       .channel();
> > > >     this.port = ((InetSocketAddress)
> channel.localAddress()).getPort();
> > > >
> > > >
> > > > Appreciate any feedback, and if a JIRA is required to keep track of
> > this
> > > > conversation. Thanks.
> > > >
> > > >
> > >
> >
>

Re: Hard Coded 0 to assign RPC Server port number when hive.execution.engine=spark

Reply via email to