[jira] [Created] (HIVE-26496) FetchOperator scans delete_delta folders multiple times causing slowness

2022-08-24 Thread Rajesh Balamohan (Jira)
Rajesh Balamohan created HIVE-26496:
---

 Summary: FetchOperator scans delete_delta folders multiple times 
causing slowness
 Key: HIVE-26496
 URL: https://issues.apache.org/jira/browse/HIVE-26496
 Project: Hive
  Issue Type: Bug
  Components: HiveServer2
Reporter: Rajesh Balamohan


FetchOperator scans way too many number of files/directories than needed.

For e.g here is a layout of a table which had set of updates and deletes. There 
are set of "delta" and "delete_delta" folders which are created.
{noformat}
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/base_001
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_002_002_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_003_003_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_004_004_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_005_005_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_006_006_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_007_007_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_008_008_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_009_009_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_010_010_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_011_011_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_012_012_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_013_013_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_014_014_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_015_015_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_016_016_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_017_017_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_018_018_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_019_019_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_020_020_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_021_021_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delete_delta_022_022_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_002_002_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_003_003_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_004_004_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_005_005_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_006_006_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_007_007_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_008_008_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_009_009_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_010_010_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_011_011_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_012_012_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_013_013_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_014_014_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_015_015_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_016_016_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_017_017_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_018_018_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_019_019_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_020_020_
s3a://bucket-name/warehouse/tablespace/managed/hive/test.db/date_dim/delta_021_021_

{noformat}
 

When user runs *{color:#0747a6}{{select * from date_dim}}{color}* from beeline, 
FetchOperator tries to compute splits in "date_dim". This "base" and "delta" 
folders and computes 2

[jira] [Created] (HIVE-26495) MSCK repair perf issue HMSChecker ThreadPool is blocked at fs.listStatus

2022-08-24 Thread Naresh P R (Jira)
Naresh P R created HIVE-26495:
-

 Summary: MSCK repair perf issue HMSChecker ThreadPool is blocked 
at fs.listStatus
 Key: HIVE-26495
 URL: https://issues.apache.org/jira/browse/HIVE-26495
 Project: Hive
  Issue Type: New Feature
Reporter: Naresh P R
Assignee: Naresh P R


With hive.metastore.fshandler.threads = 15, all 15 *MSCK-GetPaths-xx* are 
slogging at following trace.
{code:java}
"MSCK-GetPaths-11" #12345 daemon prio=5 os_prio=0 tid= nid= waiting on 
condition [0x7f9f099a6000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x0003f92d1668> (a 
java.util.concurrent.CompletableFuture$Signaller)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at 
java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
    at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
...
at org.apache.hadoop.fs.s3a.S3AFileSystem.listStatus(S3AFileSystem.java:3230)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1953)
    at org.apache.hadoop.fs.FileSystem.listStatus(FileSystem.java:1995)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreChecker$PathDepthInfoCallable.processPathDepthInfo(HiveMetaStoreChecker.java:550)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:543)
    at 
org.apache.hadoop.hive.metastore.HiveMetaStoreChecker$PathDepthInfoCallable.call(HiveMetaStoreChecker.java:525)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750){code}
We should take advantage of non-block listStatusIterator instead of listStatus 
which is a blocking call.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Should we consider Spark3 support for Hive on Spark

2022-08-24 Thread Owen O'Malley
Hive on Spark is not recommended. The recommended path is to use either Tez
or LLAP. If you already are using Spark 3, it would be far easier to use
Spark SQL.

.. Owen

On Wed, Aug 24, 2022 at 3:46 AM Fred Bai  wrote:

> Hi everyone:
>
> Do we have any support for Hive on Spark? I need Hive on Spark, but my
> Spark version is 3.X.
>
> I found Hive incompatible with Spark3, I modify a lot of code to be
> compatible.
>
> Hive on Spark has deprecated?
>
> And. Hive on Spark is very slow when the job executes.
>


Re: gRPC Support in Hive Metastore

2022-08-24 Thread Rascal Wu
It's nice to hear gRPC support in Hive Metastore again. Thanks for
contributing back to the community.

It seems that there is an old design doc about gRPC support in Hive
Metastore
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=158869886,
could we continue on this doc?


Best Regards,
Rascal Wu

Stamatis Zampetakis  于2022年8月24日周三 21:24写道:

> Hi Rohan and team,
>
> The work sounds exciting thanks for considering contributing back to the
> community.
>
> The design document didn't arrive cause attachements are not allowed in
> many Apache lists.
>
> Maybe as a first step it would be nice to share a link to a Google doc
> where people can add comments and possibly provide some feedback on it.
> Then, I guess it makes sense to put in the wiki
> under the respective section (design documents) and/or upload it in the
> JIRA case.
>
> I am not very familiar with the area so not sure if I can help much pushing
> this forward but I am definitely interested to learn more about this work.
>
> Best,
> Stamatis
>
> On Tue, Aug 23, 2022, 2:05 AM Cameron Moberg 
> wrote:
>
> >
> > *Sending on behalf of Rohan where policies don't allow sending outside of
> > our domain for interns:*
> > Hello -
> >
> > During my internship I’ve been working on gRPC native support in the
> > standalone hive metastore as it comes with a variety of benefits. As a
> > proof of concept, my team, Dataproc Metastore on GCP currently uses a
> > client side proxy to translate Thrift requests to gRPC coupled with a
> > server side proxy to translate the gRPC requests back to Thrift. The
> > process is repeated in reverse to deliver the server response to the
> > client. While this approach has been successful, native gRPC support has
> > several cloud-centric advantages over the current configuration:
> >
> >- enables streaming support
> >- allows for native integrations in Hive ecosystem for various query
> >engines like Impala, Spark SQL, and Trino to take advantage of
> streaming
> >(eventually)
> >- has support for custom interceptors for more fine-grained control
> >over the server action
> >- built on HTTP/2 protocol
> >
> > I’ve opened a PR here  (just
> > fyi, no rush), some background – this proto3 definition has been
> refactored
> > to take a MethodNameRequest and MethodNameResponse to stop any future
> > backwards incompatibilities. Unfortunately, the other metastore.proto
> which
> > has SplitInfo uses a `required` field setting, which makes upgrading it
> not
> > feasible since moving away from `required` will change the SerDe of
> proto,
> > potentially a breaking change depending on clients.
> > While this is the last week of my internship my hosts
> cjmob...@google.com
> >  and hchinch...@google.com will continue to develop in this area with
> > further implementation building on the proto.
> >
> > Attached is the full design doc, I’m not sure how I’m supposed to share
> > documents like this, so I can reupload this somewhere or convert to the
> > wiki.
> >
> > Comments are of course appreciated!
> >
> > Thank you,
> > Rohan Sonecha
> >
>


Re: gRPC Support in Hive Metastore

2022-08-24 Thread Stamatis Zampetakis
Hi Rohan and team,

The work sounds exciting thanks for considering contributing back to the
community.

The design document didn't arrive cause attachements are not allowed in
many Apache lists.

Maybe as a first step it would be nice to share a link to a Google doc
where people can add comments and possibly provide some feedback on it.
Then, I guess it makes sense to put in the wiki
under the respective section (design documents) and/or upload it in the
JIRA case.

I am not very familiar with the area so not sure if I can help much pushing
this forward but I am definitely interested to learn more about this work.

Best,
Stamatis

On Tue, Aug 23, 2022, 2:05 AM Cameron Moberg 
wrote:

>
> *Sending on behalf of Rohan where policies don't allow sending outside of
> our domain for interns:*
> Hello -
>
> During my internship I’ve been working on gRPC native support in the
> standalone hive metastore as it comes with a variety of benefits. As a
> proof of concept, my team, Dataproc Metastore on GCP currently uses a
> client side proxy to translate Thrift requests to gRPC coupled with a
> server side proxy to translate the gRPC requests back to Thrift. The
> process is repeated in reverse to deliver the server response to the
> client. While this approach has been successful, native gRPC support has
> several cloud-centric advantages over the current configuration:
>
>- enables streaming support
>- allows for native integrations in Hive ecosystem for various query
>engines like Impala, Spark SQL, and Trino to take advantage of streaming
>(eventually)
>- has support for custom interceptors for more fine-grained control
>over the server action
>- built on HTTP/2 protocol
>
> I’ve opened a PR here  (just
> fyi, no rush), some background – this proto3 definition has been refactored
> to take a MethodNameRequest and MethodNameResponse to stop any future
> backwards incompatibilities. Unfortunately, the other metastore.proto which
> has SplitInfo uses a `required` field setting, which makes upgrading it not
> feasible since moving away from `required` will change the SerDe of proto,
> potentially a breaking change depending on clients.
> While this is the last week of my internship my hosts cjmob...@google.com
>  and hchinch...@google.com will continue to develop in this area with
> further implementation building on the proto.
>
> Attached is the full design doc, I’m not sure how I’m supposed to share
> documents like this, so I can reupload this somewhere or convert to the
> wiki.
>
> Comments are of course appreciated!
>
> Thank you,
> Rohan Sonecha
>


[jira] [Created] (HIVE-26494) Fix flaky test TestJdbcWithMiniHS2 testHttpRetryOnServerIdleTimeout

2022-08-24 Thread Zhihua Deng (Jira)
Zhihua Deng created HIVE-26494:
--

 Summary: Fix flaky test TestJdbcWithMiniHS2 
testHttpRetryOnServerIdleTimeout
 Key: HIVE-26494
 URL: https://issues.apache.org/jira/browse/HIVE-26494
 Project: Hive
  Issue Type: Test
Reporter: Zhihua Deng


The TestJdbcWithMiniHS2#testHttpRetryOnServerIdleTimeout fails on master:

[http://ci.hive.apache.org/blue/organizations/jenkins/hive-precommit/detail/master/1362/tests]

It can be fixed by setting hive.server2.thrift.http.max.idle.time to a larger 
value, other than 5ms.

Flaky check: http://ci.hive.apache.org/job/hive-flaky-check/585/



--
This message was sent by Atlassian Jira
(v8.20.10#820010)