Re: CI Jenkins upgrade

2024-07-10 Thread Okumin
Hi Stamatis,

Thanks for the update. The existence of a reliable CI system is one of
the biggest reasons why we prefer to always send our patches to the
upstream and keep using as new a version as possible.

Regards,
Okumin

On Thu, Jul 4, 2024 at 7:15 PM Stamatis Zampetakis  wrote:
>
> The upgrade completed successfully and https://ci.hive.apache.org/ is
> now running with Jenkins 2.452.2. Please let me know if you encounter
> any issues.
>
> Best,
> Stamatis
>
> On Thu, Jul 4, 2024 at 11:08 AM Stamatis Zampetakis  wrote:
> >
> > The upgrade will start in a few minutes. To avoid erratic behavior
> > please avoid doing Jenkins related operations during the upgrade
> > interval.
> >
> > I will send another email once the upgrade is done.
> >
> > Best,
> > Stamatis
> >
> > On Wed, Jul 3, 2024 at 2:31 PM Stamatis Zampetakis  
> > wrote:
> > >
> > > Hi all,
> > >
> > > I plan to proceed with the Jenkins upgrade tomorrow July 4, 2024
> > > starting at 09:00 UTC.
> > >
> > > The https://ci.hive.apache.org/ will be unavailable for at least a few
> > > minutes and if all goes well, less than an hour.
> > >
> > > If there are any reasons to postpone the upgrade please let me know.
> > >
> > > Best,
> > > Stamatis
> > >
> > > On Tue, Jun 25, 2024 at 10:31 AM Stamatis Zampetakis  
> > > wrote:
> > > >
> > > > Hey team,
> > > >
> > > > As part of HIVE-28339, we are planning to upgrade the Jenkins version
> > > > that is used to run our pre-commit tests.
> > > >
> > > > The Jenkins instance is running on a Kubernetes cluster deployed on
> > > > Google Cloud.
> > > >
> > > > I am willing to push HIVE-28339 forward but I don't have any
> > > > experience in Jenkins upgrades. If someone else has done this in the
> > > > past and has any insights about the process feel free to comment under
> > > > the ticket otherwise I will hone my devops skills :)
> > > >
> > > > Probably there will be some downtime during the upgrade but we will
> > > > share more about this aspect once we have a more concrete plan.
> > > >
> > > > Best,
> > > > Stamatis


Re: Re: Next Hive 4.0.1 minor release

2024-06-26 Thread Okumin
Hi,

I'm posting some more suggestions.

# We may label HIVE-28352 as "hive-4.0.1-must"
I found that users were unable to upgrade some system tables from Hive
3 through schematool. It would be better to fix.
https://issues.apache.org/jira/browse/HIVE-28352

# We may drop HIVE-24167
I am the ticket owner, and we are struggling to figure out how to
resolve the problem. We can potentially skip it as branch-4.0 includes
the following workaround.
https://issues.apache.org/jira/browse/HIVE-27856

Thanks,
Okumin

On Wed, May 15, 2024 at 5:18 PM dengzhhu653  wrote:
>
>
> Thank you for your feedback and testing, Okumin! I've tagged them with 
> "hive-4.0.1-must".
>
>
> Thanks,
>
> Zhihua
>
> At 2024-05-15 10:43:59, "Okumin"  wrote:
> >Hi Zhihua,
> >
> >Thanks for driving the next release. We are actively testing 4.0.0 and
> >would like to give some suggestions.
> >
> ># HIVE-27847: Prevent query Failures on Numeric <-> Timestamp
> >We hit the issue when we ran Hive 4 with the option. I believe it is
> >worth resolving for those who want to try Hive 4, keeping
> >compatibilities with a previous version.
> >https://issues.apache.org/jira/browse/HIVE-27847
> >
> ># HIVE-28098: Fails to copy empty column statistics of materialized CTE
> >This follows up on HIVE-28080, but the current 4.0.0 includes only
> >HIVE-28080. The reasonable option to me is to revert HIVE-28080 or
> >cherry-pick HIVE-28098, all or nothing.
> >https://issues.apache.org/jira/browse/HIVE-28098
> >
> >Thanks,
> >Okumin
> >
> >On Sat, May 11, 2024 at 9:45 AM dengzhhu653  wrote:
> >>
> >> Hello Community,
> >>
> >>
> >> As you have noticed, we are going to propose the next 4.0.1 release on top 
> >> of 4.0.0, with some
> >>
> >> critical bug fixes and improvements [1]. As of now we are putting the 
> >> label "hive-4.0.1-must" on the tickets
> >>
> >> and we plan to make sure those get c-picked to branch-4.0 [2]. Please 
> >> suggest other important fixes that can be
> >>
> >> included in this release if any.
> >>
> >>
> >> We will get this minor release out as soon as possible once all the 
> >> tickets marked with "hive-4.0.1-must" get resolved and tested.
> >>
> >>
> >> [1] https://lists.apache.org/thread/rkw2toj5d74t8n5jvnkrfw77hyzn7qh3
> >>
> >> [2] 
> >> https://issues.apache.org/jira/browse/HIVE-28204?jql=labels%20%3D%20hive-4.0.1-must
> >>
> >>
> >> Thanks,
> >>
> >> Zhihua


Re: Next Hive 4.0.1 minor release

2024-05-14 Thread Okumin
Hi Zhihua,

Thanks for driving the next release. We are actively testing 4.0.0 and
would like to give some suggestions.

# HIVE-27847: Prevent query Failures on Numeric <-> Timestamp
We hit the issue when we ran Hive 4 with the option. I believe it is
worth resolving for those who want to try Hive 4, keeping
compatibilities with a previous version.
https://issues.apache.org/jira/browse/HIVE-27847

# HIVE-28098: Fails to copy empty column statistics of materialized CTE
This follows up on HIVE-28080, but the current 4.0.0 includes only
HIVE-28080. The reasonable option to me is to revert HIVE-28080 or
cherry-pick HIVE-28098, all or nothing.
https://issues.apache.org/jira/browse/HIVE-28098

Thanks,
Okumin

On Sat, May 11, 2024 at 9:45 AM dengzhhu653  wrote:
>
> Hello Community,
>
>
> As you have noticed, we are going to propose the next 4.0.1 release on top of 
> 4.0.0, with some
>
> critical bug fixes and improvements [1]. As of now we are putting the label 
> "hive-4.0.1-must" on the tickets
>
> and we plan to make sure those get c-picked to branch-4.0 [2]. Please suggest 
> other important fixes that can be
>
> included in this release if any.
>
>
> We will get this minor release out as soon as possible once all the tickets 
> marked with "hive-4.0.1-must" get resolved and tested.
>
>
> [1] https://lists.apache.org/thread/rkw2toj5d74t8n5jvnkrfw77hyzn7qh3
>
> [2] 
> https://issues.apache.org/jira/browse/HIVE-28204?jql=labels%20%3D%20hive-4.0.1-must
>
>
> Thanks,
>
> Zhihua


Re: [VOTE] Mark Hive 2.x EOL

2024-05-13 Thread Okumin
+1 (non-binding)

I appreciate the community's efforts in maintaining 2.x for so long.

Thanks,
Okumin

On Sat, May 11, 2024 at 1:57 AM Abhishek Gupta  wrote:
>
> Unsubscribe
>
> On Fri, 10 May 2024 at 10:26 PM, Aman Sinha  wrote:
>>
>> +1 (non-binding)
>>
>> On Fri, May 10, 2024 at 7:57 AM Mahesh Raju Somalaraju 
>>  wrote:
>>>
>>> +1(non-binding)
>>>
>>> Thanks
>>> Mahesh Raju S
>>>
>>> On Fri, 10 May 2024, 06:15 Ayush Saxena,  wrote:
>>>>
>>>> Hi All,
>>>> Following the discussion at [1]. Starting the official vote thread to
>>>> mark Hive 2.x release line as EOL.
>>>>
>>>> Marking a release lines as EOL means there won't be any further
>>>> release made for that release line
>>>>
>>>> I will start with my +1
>>>>
>>>> -Ayush
>>>>
>>>>
>>>> [1] https://lists.apache.org/thread/91wk3oy1qo953md7941ojg2q97ofsl2d


Re: [ANNOUNCE] Apache Hive 4.0.0 Released

2024-04-03 Thread Okumin
I'm really excited to see the news! I can easily imagine the
difficulty of testing and shipping Hive 4.0.0 with more than 5k
commits. I'm proud to have witnessed this moment here.

Thank you!

On Wed, Apr 3, 2024 at 3:07 AM Naveen Gangam  wrote:
>
> Thank you for the tremendous amount of work put in by many many folks to make 
> this release happen, including projects hive is dependent upon like tez.
>
> Thank you to all the PMC members, committers and contributors for all the 
> work over the past 5+ years in shaping this release.
>
> THANK YOU!!!
>
> On Sun, Mar 31, 2024 at 8:54 AM Battula, Brahma Reddy  
> wrote:
>>
>> Thank you for your hard work and dedication in releasing Apache Hive version 
>> 4.0.0.
>>
>>
>>
>> Congratulations to the entire team on this achievement. Keep up the great 
>> work!
>>
>>
>>
>> Does this consider as GA.?
>>
>>
>>
>> And Looks we need to update in the following location also.?
>>
>> https://hive.apache.org/general/downloads/
>>
>>
>>
>>
>>
>> From: Denys Kuzmenko 
>> Date: Saturday, March 30, 2024 at 00:07
>> To: u...@hive.apache.org , dev@hive.apache.org 
>> 
>> Subject: [ANNOUNCE] Apache Hive 4.0.0 Released
>>
>> The Apache Hive team is proud to announce the release of Apache Hive
>>
>> version 4.0.0.
>>
>>
>>
>> The Apache Hive (TM) data warehouse software facilitates querying and
>>
>> managing large datasets residing in distributed storage. Built on top
>>
>> of Apache Hadoop (TM), it provides, among others:
>>
>>
>>
>> * Tools to enable easy data extract/transform/load (ETL)
>>
>>
>>
>> * A mechanism to impose structure on a variety of data formats
>>
>>
>>
>> * Access to files stored either directly in Apache HDFS (TM) or in other
>>
>>   data storage systems such as Apache HBase (TM)
>>
>>
>>
>> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache Spark 
>> frameworks. (MapReduce is deprecated, and Spark has been removed so the text 
>> needs to be modified depending on the release version)
>>
>>
>>
>> For Hive release details and downloads, please visit:
>>
>> https://hive.apache.org/downloads.html
>>
>>
>>
>> Hive 4.0.0 Release Notes are available here:
>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343343=Text=12310843
>>
>>
>>
>> We would like to thank the many contributors who made this release
>>
>> possible.
>>
>>
>>
>> Regards,
>>
>>
>>
>> The Apache Hive Team


Re: Release of Hive 4 and TPC-DS benchmark

2024-01-29 Thread Okumin
Hi,

We deployed a test version of Hive 4 in our production. It is mostly
working well. I appreciate the community's effort.

Let me bump this thread so that community members can share the latest
situation, and we can contribute to the remaining problems.

Currently, we list the following tickets labeled with hive-4.0.0-must[1].
- HIVE-27858: OOM happens when selecting many columns and JOIN
- HIVE-26654: Test with the TPC-DS benchmark
- HIVE-24167: No.14 (Postponed until 4.1.0)
- HIVE-26986: No.71 (No labels)

We have already applied a workaround[2] for HIVE-24167. Also, Denys
says HIVE-26986 is not a blocker of Hive 4.0.0. So, I expect
HIVE-27858 to be the only blocker.

I have some proposals about the next action.
- Can we take over HIVE-27858? We are the reporter of the ticket. It
would be reasonable If Naveen is not working on it yet
- We may review HIVE-27984[3]. As I mentioned in another thread[4], it
would have a fairly big impact on the Hive ecosystem

And I would like someone to take a look at the following PR for
HIVE-24167. It is not a blocker thanks to
`hive.optimize.cte.materialize.threshold=-1`. However, we plan to use
Hive 4 with CTE materialization.
https://github.com/apache/hive/pull/5037

Thanks!

- [1] https://issues.apache.org/jira/issues/?jql=labels%20%3D%20hive-4.0.0-must
- [2] https://github.com/apache/hive/pull/4858
- [3] https://github.com/apache/hive/pull/4984
- [4] https://www.mail-archive.com/dev@hive.apache.org/msg144280.html

Regards,
Okumin

On Thu, Nov 23, 2023 at 11:42 PM Butao Zhang  wrote:
>
> Cool! Thank you Denys for driving this!
>
>
> Thanks,
> Butao Zhang
>  Replied Message 
> | From | Attila Turoczy |
> | Date | 11/23/2023 22:29 |
> | To |  |
> | Subject | Re: Release of Hive 4 and TPC-DS benchmark |
> Excellent new Denys! Hive 4 is here! Can't wait :)
>
> -Attila
>
>
> On Thu, Nov 23, 2023 at 3:20 PM Denys Kuzmenko  wrote:
>
> Update:
> 1. Query 2, 71: Resolved in HIVE-27006 [1];
>
> 2. Query 97: Under review in HIVE-27269 [2];
> Thanks, Seonggon for providing a reproduce qfile.
>
> 3. Query 14: Reported in HIVE-24167 [3];
> set hive.optimize.cte.materialize.threshold to -1 by default in Hive 4 and
> fix it in the next versions
>
> 4. HIVE-26986 [4] is a performance improvement that is nice to have, but
> not a blocker for the release.
>
> Be advised, next week we plan to cut 4.0.0 release branch from master and
> start testing.
>
> Thanks, Denys
>
> [1] https://issues.apache.org/jira/browse/HIVE-27006
> [2] https://issues.apache.org/jira/browse/HIVE-27269
> [3] https://issues.apache.org/jira/browse/HIVE-24167
> [4] https://issues.apache.org/jira/browse/HIVE-26986
>
>


Re: How to use `engine` introduced by HIVE-22046

2024-01-09 Thread Okumin
Hi Butao,

Thanks for reminding us about the issue. +1 for the approach.

Thanks,
Okumin

On Mon, Jan 8, 2024 at 1:38 PM Butao Zhang  wrote:
>
> Hi, dev
> Bump this thread!
> I just filed a ticket to track this incompatibility issues about hms 
> column stats thrift api.  I think we can fix this at its root in HMS side and 
> then any other third components will not suffer from this issue.
> https://issues.apache.org/jira/browse/HIVE-27984
>
>
> Thanks,
> Butao Zhang
> ---- Replied Message 
> | From | Okumin |
> | Date | 8/20/2023 18:37 |
> | To |  |
> | Subject | Re: How to use `engine` introduced by HIVE-22046 |
> Hi Butao,
>
> Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache
> or trinodb/hive-thrift.
>
> As mentioned in the PR, the current Thrift definitions might not be the
> final version, but it sounds reasonable to give information to external
> products since we versioned Hive 4 beta. I'm curious if anyone why we give
> different engine names to Hive and Impala and what are the recommended
> options.
>
> Thanks,
> Okumin
>
>
>
> On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang  wrote:
>
> Hi, Okumin
>
>
> I have encountered this issue before, and the 'validWriteIdList' is also a
> incompatibility parameter. I have submit a PR in trino-hive-apache repo,
> and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43
> .
> IIUC, the 'engine' parameter is used to differentiate between stats
> produced by different engines(Hive), but it seems that
> the downstream engines do not want to adopt the new 'engine'
> parameter.
> At present, if some engines(e.g. Trino) use the customized thrift api to
> interact wiht hms, it must change its thrift file to match the thrift
> definition of hms.
> BTW, maybe we can change hms thrift file to make the 'engine' parameter
> optional and then other customized thrift client will not have
> compatibility issues.
>
> Thanks,
>
> Butao Zhang
>
>  Replied Message 
> | From | Okumin |
> | Date | 8/10/2023 23:41 |
> | To |  |
> | Subject | How to use `engine` introduced by HIVE-22046 |
> Hi Hive developers,
>
> I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm
> testing integration between Hive 4 and other software. If I understand
> correctly, clients are currently required to additionally specify the
> engine name when they get or update column statistics.
>
> - https://issues.apache.org/jira/browse/HIVE-22046
> - https://github.com/apache/hive/pull/741
>
> For example, Trino has a feature to use column stats and it fails. Note
> that I am not 100% sure about Trino's implementation or behavior.
>
> ```
> trino> create table hive.default.test_trino (id int);
> Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset!
> Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id],
> engine:null)
> ```
>
> I have two questions about this feature.
>
> (1) Should any engine use a unique engine name?
>
> I guess some software can store or use stats compatible with Hive. I wonder
> if it can reuse engine=hive in that case, or should use a different name
> like engine=trino.
>
> I see Impala gives a unique engine name to metastore. Taking a glance,
> Spark is unlikely to be using col stats of Hive directly.
>
> - https://issues.apache.org/jira/browse/IMPALA-8842
>
> (2) Should Hive Metastore use engine=hive as a default value?
>
> If other compatible software can reuse engine=hive, it could be an option
> to accept requests with the old format assuming its engine is "hive" for
> compatibility. Or should they explicitly specify engine=hive when using
> Hive 4?
>
> Regards,
> Okumin
>


Discussion about HIVE-12679 to make IMetaStoreClient pluggable

2023-10-13 Thread Okumin
Hi,

I'm working on introducing a feature to make IMetaStoreClient pluggable.
I'm sending this e-mail to gather opinions in a visible manner because it
has controversial points.

Some Hive users need the feature in order to integrate Hive with a data
catalog other than HMS. Although the original patch was submitted more than
7 years ago and many users have wanted it, it has not been merged yet.
I revived the ticket and PR so that we can maintain or improve it within
the community.

- JIRA: https://issues.apache.org/jira/browse/HIVE-12679
- PR: https://github.com/apache/hive/pull/

I initially created the above PR based on the original design. That's
because I think it is reasonable enough and I can see some users have
already ported the patch for the past 7 years. But there are also other
opinions to suggest other designs. This is a summary for easy catch-up.

https://gist.github.com/okumin/30b058b14db1b099ba37ba7dc257fe8e

If you are interested in this problem and you have any opinions, please put
a comment on the Pull Request.

Regards,
Okumin


Re: How to use `engine` introduced by HIVE-22046

2023-08-20 Thread Okumin
Hi Butao,

Thanks for sharing your PR! I didn't find trinodb/trino-hive-apache
or trinodb/hive-thrift.

As mentioned in the PR, the current Thrift definitions might not be the
final version, but it sounds reasonable to give information to external
products since we versioned Hive 4 beta. I'm curious if anyone why we give
different engine names to Hive and Impala and what are the recommended
options.

Thanks,
Okumin



On Fri, Aug 11, 2023 at 10:39 AM Butao Zhang  wrote:

> Hi, Okumin
>
>
> I have encountered this issue before, and the 'validWriteIdList' is also a
> incompatibility parameter. I have submit a PR in trino-hive-apache repo,
> and you can refer to https://github.com/trinodb/trino-hive-apache/pull/43
> .
> IIUC, the 'engine' parameter is used to differentiate between stats
> produced by different engines(Hive), but it seems that
> the downstream engines do not want to adopt the new 'engine'
> parameter.
> At present, if some engines(e.g. Trino) use the customized thrift api to
> interact wiht hms, it must change its thrift file to match the thrift
> definition of hms.
> BTW, maybe we can change hms thrift file to make the 'engine' parameter
> optional and then other customized thrift client will not have
> compatibility issues.
>
> Thanks,
>
> Butao Zhang
>
>  Replied Message 
> | From | Okumin |
> | Date | 8/10/2023 23:41 |
> | To |  |
> | Subject | How to use `engine` introduced by HIVE-22046 |
> Hi Hive developers,
>
> I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm
> testing integration between Hive 4 and other software. If I understand
> correctly, clients are currently required to additionally specify the
> engine name when they get or update column statistics.
>
> - https://issues.apache.org/jira/browse/HIVE-22046
> - https://github.com/apache/hive/pull/741
>
> For example, Trino has a feature to use column stats and it fails. Note
> that I am not 100% sure about Trino's implementation or behavior.
>
> ```
> trino> create table hive.default.test_trino (id int);
> Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset!
> Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id],
> engine:null)
> ```
>
> I have two questions about this feature.
>
> (1) Should any engine use a unique engine name?
>
> I guess some software can store or use stats compatible with Hive. I wonder
> if it can reuse engine=hive in that case, or should use a different name
> like engine=trino.
>
> I see Impala gives a unique engine name to metastore. Taking a glance,
> Spark is unlikely to be using col stats of Hive directly.
>
> - https://issues.apache.org/jira/browse/IMPALA-8842
>
> (2) Should Hive Metastore use engine=hive as a default value?
>
> If other compatible software can reuse engine=hive, it could be an option
> to accept requests with the old format assuming its engine is "hive" for
> compatibility. Or should they explicitly specify engine=hive when using
> Hive 4?
>
> Regards,
> Okumin
>


How to use `engine` introduced by HIVE-22046

2023-08-10 Thread Okumin
Hi Hive developers,

I noticed HIVE-22046 introduced incompatibility to Metastore APIs while I'm
testing integration between Hive 4 and other software. If I understand
correctly, clients are currently required to additionally specify the
engine name when they get or update column statistics.

- https://issues.apache.org/jira/browse/HIVE-22046
- https://github.com/apache/hive/pull/741

For example, Trino has a feature to use column stats and it fails. Note
that I am not 100% sure about Trino's implementation or behavior.

```
trino> create table hive.default.test_trino (id int);
Query 20230810_152236_4_t9n6h failed: Required field 'engine' is unset!
Struct:TableStatsRequest(dbName:default, tblName:test_trino, colNames:[id],
engine:null)
```

I have two questions about this feature.

(1) Should any engine use a unique engine name?

I guess some software can store or use stats compatible with Hive. I wonder
if it can reuse engine=hive in that case, or should use a different name
like engine=trino.

I see Impala gives a unique engine name to metastore. Taking a glance,
Spark is unlikely to be using col stats of Hive directly.

- https://issues.apache.org/jira/browse/IMPALA-8842

(2) Should Hive Metastore use engine=hive as a default value?

If other compatible software can reuse engine=hive, it could be an option
to accept requests with the old format assuming its engine is "hive" for
compatibility. Or should they explicitly specify engine=hive when using
Hive 4?

Regards,
Okumin


Re: Possibility of supporting TIME or other standard types

2023-08-09 Thread Okumin
Hi Stamatis,

Thanks for your reply and +1. Anyway, I also guess there is no semantic
reason not to support the type.
I created a ticket and may try to work on it when I have a chance.
https://issues.apache.org/jira/browse/HIVE-27587

Please let me know if anyone else knows any difficulties or historical
reasons why we are not supporting it.

Regards,
Okumin


On Wed, Aug 9, 2023 at 5:50 PM Stamatis Zampetakis 
wrote:

> Hello Okumin,
>
> As you mentioned the TIME datatype is part of the SQL standard and it
> is also supported by many popular DBMS so it definitely makes sense to
> add it to Hive.
> I guess it was not implemented already cause users were able to store
> times using other existing types so it never became a must have.
>
> +1 on adding TIME data type support in Hive.
>
> Best,
> Stamatis
>
> On Fri, Aug 4, 2023 at 12:19 PM Okumin  wrote:
> >
> > Hi everyone,
> >
> > I happened to find some people struggling to store values of the TIME
> type
> > in a Hive table from another query engine. A viable option is to use
> > Iceberg or another format instead of Hive native tables since Hive
> doesn't
> > directly support the type. I agree that it could be one of the right
> > options in this era.
> >
> > On the other hand, I also think it is a valid request to support TIME
> since
> > it is one of the types defined in the SQL standard. I expect it is not a
> > bad offer also for Hive users if they can process Iceberg's TIME as
> Hive's
> > TIME, not STRING or other alternatives. I'd like to hear if we have any
> > reasons not to support it easily.
> >
> > I see there is a related ticket but looks like we have not developed TIME
> > yet.
> > https://issues.apache.org/jira/browse/HIVE-1269
> >
> > Regards,
> > Okumin
>


Possibility of supporting TIME or other standard types

2023-08-04 Thread Okumin
Hi everyone,

I happened to find some people struggling to store values of the TIME type
in a Hive table from another query engine. A viable option is to use
Iceberg or another format instead of Hive native tables since Hive doesn't
directly support the type. I agree that it could be one of the right
options in this era.

On the other hand, I also think it is a valid request to support TIME since
it is one of the types defined in the SQL standard. I expect it is not a
bad offer also for Hive users if they can process Iceberg's TIME as Hive's
TIME, not STRING or other alternatives. I'd like to hear if we have any
reasons not to support it easily.

I see there is a related ticket but looks like we have not developed TIME
yet.
https://issues.apache.org/jira/browse/HIVE-1269

Regards,
Okumin


Re: Introduce Uniffle : A stability solution of Hive's shuffle

2023-07-20 Thread Okumin
Hi Rory,

Let me express my gratitude and positive impression of Uniffle.
Actually, we also feel the necessity of a shuffle service for our Hive
deployment, and I've been watching the project. I will check the
implementation for Tez and send feedback or PRs if I find something.

Regards,
Okumin

On Tue, Jul 11, 2023 at 9:48 PM roryqi  wrote:

> Dear Apache Hive community,
>
>
> We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
> supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.
>
> Uniffle is a remote shuffle service. In several situations, Uniffle will
> provide great help.
>
>1. If you use AWS spot instances or mix resources, tasks may be
>preempted. It will be great if we store shuffle data in the Uniffle and
> we
>can deploy Uniffle on some stable resource. It will improve the
> stability
>of tasks. If tasks are preempted, we won’t recompute tasks if we store
>shuffle in the Uniffle.
>2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
>Uniffle can improve the performance of jobs. For 1TB MapReduce
> Terasort, 1w
>map tasks, 1w reduce tasks, job performance will increase 30%.
>
> We also welcome pull requests and are eager to see how you might use
> Uniffle to make Hive more user-friendly. More information, you can access
> https://github.com/apache/incubator-uniffle
>
>
> Best
>
> Rory
>


[jira] [Created] (HIVE-27120) Warn when Authorizer V2 is configured

2023-03-03 Thread okumin (Jira)
okumin created HIVE-27120:
-

 Summary: Warn when Authorizer V2 is configured
 Key: HIVE-27120
 URL: https://issues.apache.org/jira/browse/HIVE-27120
 Project: Hive
  Issue Type: Improvement
Reporter: okumin


SessionState configures an internal param that is not listed in HiveConf. It 
causes WARN.
{code:java}
pod/hive-hiveserver2-7fc4df88b6-dmn8v: 2023-03-01T13:50:38,959  WARN [main] 
conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker 
does not exist {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26184) COLLECT_SET with GROUP BY is very slow when some keys are highly skewed

2022-04-28 Thread okumin (Jira)
okumin created HIVE-26184:
-

 Summary: COLLECT_SET with GROUP BY is very slow when some keys are 
highly skewed
 Key: HIVE-26184
 URL: https://issues.apache.org/jira/browse/HIVE-26184
 Project: Hive
  Issue Type: Bug
  Components: Hive
Affects Versions: 3.1.3, 2.3.8
Reporter: okumin
Assignee: okumin


I observed some reducers spend 98% of CPU time in invoking 
`java.util.HashMap#clear`.

Looking the detail, I found COLLECT_SET reuses a LinkedHashSet and its `clear` 
can be quite heavy when a relation has a small number of highly skewed keys.

 

To reproduce the issue, first, we will create rows with a skewed key.
{code:java}
INSERT INTO test_collect_set
SELECT '----' AS key, CAST(UUID() AS VARCHAR) 
AS value
FROM table_with_many_rows
LIMIT 10;{code}
Then, we will create many non-skewed rows.
{code:java}
INSERT INTO test_collect_set
SELECT UUID() AS key, UUID() AS value
FROM sample_datasets.nasdaq
LIMIT 500;{code}
We can observe the issue when we aggregate values by `key`.
{code:java}
SELECT key, COLLECT_SET(value) FROM group_by_skew GROUP BY key{code}



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (HIVE-25410) CommonMergeJoinOperator fails when a join key is ARRAY with arbitrary size

2021-07-30 Thread okumin (Jira)
okumin created HIVE-25410:
-

 Summary: CommonMergeJoinOperator fails when a join key is ARRAY 
with arbitrary size
 Key: HIVE-25410
 URL: https://issues.apache.org/jira/browse/HIVE-25410
 Project: Hive
  Issue Type: Bug
  Components: Hive
Reporter: okumin
Assignee: okumin
 Fix For: 4.0.0


Thanks to HIVE-24883, CommonMergeJoinOperator can handle ARRAY or STRUCT types 
as a JOIN key.

There are corner cases where CommonMergeJoinOperator fails with 
`ArrayIndexOutOfBoundsException`.

 

This is a simple case.
{code:java}
SET hive.auto.convert.join=false;
CREATE TABLE table_list_types (id int, key array);
INSERT INTO table_list_types VALUES (1, array(1, 2)), (2, array(1, 2)), (3, 
array(1, 2, 3)), (4, array(1, 2, 3));
SELECT * FROM table_list_types t1 INNER JOIN table_list_types t2 ON t1.key = 
t2.key; {code}
With 69c97c26ac68a245f4d327cc2f7b3a2333f8fa84, the following error happened.
{code:java}
Caused by: java.lang.ArrayIndexOutOfBoundsException: 2
at 
org.apache.hadoop.hive.ql.exec.HiveStructComparator.compare(HiveStructComparator.java:57)
at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKey(CommonMergeJoinOperator.java:629)
at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.compareKeys(CommonMergeJoinOperator.java:597)
at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.processKey(CommonMergeJoinOperator.java:566)
at 
org.apache.hadoop.hive.ql.exec.CommonMergeJoinOperator.process(CommonMergeJoinOperator.java:249)
at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
... 26 more {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24606) Multi-stage materialized CTEs can lost intermediate data

2021-01-08 Thread okumin (Jira)
okumin created HIVE-24606:
-

 Summary: Multi-stage materialized CTEs can lost intermediate data
 Key: HIVE-24606
 URL: https://issues.apache.org/jira/browse/HIVE-24606
 Project: Hive
  Issue Type: Bug
  Components: Query Planning
Affects Versions: 3.1.2, 2.3.7, 4.0.0
Reporter: okumin
Assignee: okumin


With complex multi-stage CTEs, Hive can start a latter stage before its 
previous stage finishes.
 That's because `SemanticAnalyzer#toRealRootTasks` can fail to resolve 
dependency between multistage materialized CTEs when a non-materialized CTE 
cuts in.
 
[https://github.com/apache/hive/blob/425e1ff7c054f87c4db87e77d004282d529599ae/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L1414]

 

For example, when submitting this query,
{code:sql}
SET hive.optimize.cte.materialize.threshold=2;
SET hive.optimize.cte.materialize.full.aggregate.only=false;

WITH x AS ( SELECT 'x' AS id ), -- not materialized
a1 AS ( SELECT 'a1' AS id ), -- materialized by a2 and the root
a2 AS ( SELECT 'a2 <- ' || id AS id FROM a1) -- materialized by the root
SELECT * FROM a1
UNION ALL
SELECT * FROM x
UNION ALL
SELECT * FROM a2
UNION ALL
SELECT * FROM a2;
{code}
`toRealRootTask` will traverse the CTEs in order of `a1`, `x`, and `a2`. It 
means the dependency between `a1` and `a2` will be ignored and `a2` can start 
without waiting for `a1`. As a result, the above query returns the following 
result.
{code:java}
+-+
| id  |
+-+
| a1  |
| x   |
+-+
{code}
For your information, I ran this test with revision = 
425e1ff7c054f87c4db87e77d004282d529599ae.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24485) Make the slow-start behavior tunable

2020-12-04 Thread okumin (Jira)
okumin created HIVE-24485:
-

 Summary: Make the slow-start behavior tunable
 Key: HIVE-24485
 URL: https://issues.apache.org/jira/browse/HIVE-24485
 Project: Hive
  Issue Type: Improvement
  Components: Hive, Tez
Affects Versions: 3.1.2, 4.0.0
Reporter: okumin
Assignee: okumin


This ticket would enable users to configure the timing of slow-start with 
`tez.shuffle-vertex-manager.min-src-fraction` and 
`tez.shuffle-vertex-manager.max-src-fraction`.
Hive on Tez currently doesn't honor these parameters and ShuffleVertexManager 
always uses the default value.
We can control the timing to start vertexes the accuracy of estimated input 
size if we can tweak these ones. This is useful when a vertex has tasks that 
process a different amount of data.

 

We can reproduce the issue with this query.
{code:java}
SET hive.tez.auto.reducer.parallelism=true;
SET hive.tez.min.partition.factor=1.0; -- enforce auto-parallelism
SET tez.shuffle-vertex-manager.min-src-fraction=0.55;
SET tez.shuffle-vertex-manager.max-src-fraction=0.95;
CREATE TABLE mofu (name string);
INSERT INTO mofu (name) VALUES ('12345');
SELECT name, count(*) FROM mofu GROUP BY name;{code}
The fractions are ignored.
{code:java}
2020-12-04 11:41:42,484 [INFO] [Dispatcher thread {Central}] 
|vertexmanager.ShuffleVertexManagerBase|: Settings minFrac: 0.25 maxFrac: 0.75 
auto: true desiredTaskIput: 25600
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24286) Render date and time with progress of Hive on Tez

2020-10-19 Thread okumin (Jira)
okumin created HIVE-24286:
-

 Summary: Render date and time with progress of Hive on Tez
 Key: HIVE-24286
 URL: https://issues.apache.org/jira/browse/HIVE-24286
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: okumin
Assignee: okumin


Add date/time to each line written by RenderStrategy like MapReduce and Spark.

 
 * 
[https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HadoopJobExecHelper.java#L350]
 * 
[https://github.com/apache/hive/blob/31c1658d9884eb4f31b06eaa718dfef8b1d92d22/ql/src/java/org/apache/hadoop/hive/ql/exec/spark/status/RenderStrategy.java#L64-L67]

 

This ticket would add the current time to the head of each line.

 
{code:java}
2020-10-19 13:32:41,162 Map 1: 0/1  Reducer 2: 0/1  
2020-10-19 13:32:44,231 Map 1: 0/1  Reducer 2: 0/1  
2020-10-19 13:32:46,813 Map 1: 0(+1)/1  Reducer 2: 0/1  
2020-10-19 13:32:49,878 Map 1: 0(+1)/1  Reducer 2: 0/1  
2020-10-19 13:32:51,416 Map 1: 1/1  Reducer 2: 0/1  
2020-10-19 13:32:51,936 Map 1: 1/1  Reducer 2: 0(+1)/1  
2020-10-19 13:32:52,877 Map 1: 1/1  Reducer 2: 1/1  
{code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24240) Implement missing features in UDTFStatsRule

2020-10-07 Thread okumin (Jira)
okumin created HIVE-24240:
-

 Summary: Implement missing features in UDTFStatsRule
 Key: HIVE-24240
 URL: https://issues.apache.org/jira/browse/HIVE-24240
 Project: Hive
  Issue Type: Improvement
Affects Versions: 4.0.0
Reporter: okumin
Assignee: okumin


Add the following steps.
 * Handle the case in which the num row will be zero
 * Compute runtime stats in case of a re-execution



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-24203) Implement stats annotation rule for the LateralViewJoinOperator

2020-09-27 Thread okumin (Jira)
okumin created HIVE-24203:
-

 Summary: Implement stats annotation rule for the 
LateralViewJoinOperator
 Key: HIVE-24203
 URL: https://issues.apache.org/jira/browse/HIVE-24203
 Project: Hive
  Issue Type: Improvement
  Components: Physical Optimizer
Affects Versions: 2.3.7, 3.1.2, 4.0.0
Reporter: okumin
Assignee: okumin


StatsRulesProcFactory doesn't have any rules to handle a JOIN by LATERAL VIEW.

This can cause an underestimation in case that UDTF in LATERAL VIEW generates 
multiple rows.

HIVE-20262 has already added the rule for UDTF.

This issue would add the rule for LateralViewJoinOperator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-23831) Enforce auto-parallelism of Tez even if the number of reducers can be 1

2020-07-09 Thread okumin (Jira)
okumin created HIVE-23831:
-

 Summary: Enforce auto-parallelism of Tez even if the number of 
reducers can be 1
 Key: HIVE-23831
 URL: https://issues.apache.org/jira/browse/HIVE-23831
 Project: Hive
  Issue Type: Improvement
  Components: Hive
Reporter: okumin
Assignee: okumin


Hive on Tez disables auto-reduce and uses the maximum number of partitions when 
the minimum number of reducers, `estimated number * min.partition.factor`, is 
less than or equal to 1.

[https://github.com/apache/hive/blob/f98e136bdd5642e3de10d2fd1a4c14d1d6762113/ql/src/java/org/apache/hadoop/hive/ql/parse/GenTezUtils.java#L121-L131]

https://issues.apache.org/jira/browse/HIVE-14200

 

This can make Tez lose chance to optimize # of reducers especially in an 
environment with small `hive.tez.min.partition.factor`.

I'd say it would be fine to introduce an option to enforce auto-reduce.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22618) Fix checkstyle violations for ParseUtils

2019-12-10 Thread okumin (Jira)
okumin created HIVE-22618:
-

 Summary: Fix checkstyle violations for ParseUtils
 Key: HIVE-22618
 URL: https://issues.apache.org/jira/browse/HIVE-22618
 Project: Hive
  Issue Type: Task
  Components: Query Processor
Reporter: okumin
Assignee: okumin


ParseUtils.java has incorrect indents and it causes any changes to violate 
checkstyle.

[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/ParseUtils.java#L274]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (HIVE-22601) Some columns will be lost when a UDTF has multiple aliases in some cases

2019-12-09 Thread okumin (Jira)
okumin created HIVE-22601:
-

 Summary: Some columns will be lost when a UDTF has multiple 
aliases in some cases
 Key: HIVE-22601
 URL: https://issues.apache.org/jira/browse/HIVE-22601
 Project: Hive
  Issue Type: Bug
  Components: Query Processor
Affects Versions: 3.1.2
Reporter: okumin
Assignee: okumin


Only one column will be retained when putting UDTFs with multiple aliases and a 
top-level UNION together.

For example, the result of the following SQL should have three columns, c1, c2 
and c3.
{code:java}
SELECT stack(1, 'a', 'b', 'c') AS (c1, c2, c3)
UNION ALL
SELECT stack(1, 'd', 'e', 'f') AS (c1, c2, c3);
{code}
However, It's only the c3 column which I can get.
{code:java}
+-+
| _u1.c3  |
+-+
| c   |
| f   |
+-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)