Re: [ANNOUNCE] Apache Hive 4.0.0 Released

2024-04-04 Thread Sungwoo Park
Congratulations and huge thanks to Apache Hive team and contributors for
releasing Hive 4. We have been watching the development of Hive 4 since the
release of Hive 3.1, and it's truly satisfying to witness the resolution of
all the critical issues at last after 5 years. Hive 4 comes with a lot of
new great features, and our initial performance benchmarking indicates that
it comes with a significant improvement over Hive 3 in terms of speed.

--- Sungwoo

On Wed, Apr 3, 2024 at 10:30 PM Okumin  wrote:

> I'm really excited to see the news! I can easily imagine the
> difficulty of testing and shipping Hive 4.0.0 with more than 5k
> commits. I'm proud to have witnessed this moment here.
>
> Thank you!
>
> On Wed, Apr 3, 2024 at 3:07 AM Naveen Gangam  wrote:
> >
> > Thank you for the tremendous amount of work put in by many many folks to
> make this release happen, including projects hive is dependent upon like
> tez.
> >
> > Thank you to all the PMC members, committers and contributors for all
> the work over the past 5+ years in shaping this release.
> >
> > THANK YOU!!!
> >
> > On Sun, Mar 31, 2024 at 8:54 AM Battula, Brahma Reddy 
> wrote:
> >>
> >> Thank you for your hard work and dedication in releasing Apache Hive
> version 4.0.0.
> >>
> >>
> >>
> >> Congratulations to the entire team on this achievement. Keep up the
> great work!
> >>
> >>
> >>
> >> Does this consider as GA.?
> >>
> >>
> >>
> >> And Looks we need to update in the following location also.?
> >>
> >> https://hive.apache.org/general/downloads/
> >>
> >>
> >>
> >>
> >>
> >> From: Denys Kuzmenko 
> >> Date: Saturday, March 30, 2024 at 00:07
> >> To: u...@hive.apache.org , dev@hive.apache.org <
> dev@hive.apache.org>
> >> Subject: [ANNOUNCE] Apache Hive 4.0.0 Released
> >>
> >> The Apache Hive team is proud to announce the release of Apache Hive
> >>
> >> version 4.0.0.
> >>
> >>
> >>
> >> The Apache Hive (TM) data warehouse software facilitates querying and
> >>
> >> managing large datasets residing in distributed storage. Built on top
> >>
> >> of Apache Hadoop (TM), it provides, among others:
> >>
> >>
> >>
> >> * Tools to enable easy data extract/transform/load (ETL)
> >>
> >>
> >>
> >> * A mechanism to impose structure on a variety of data formats
> >>
> >>
> >>
> >> * Access to files stored either directly in Apache HDFS (TM) or in other
> >>
> >>   data storage systems such as Apache HBase (TM)
> >>
> >>
> >>
> >> * Query execution via Apache Hadoop MapReduce, Apache Tez and Apache
> Spark frameworks. (MapReduce is deprecated, and Spark has been removed so
> the text needs to be modified depending on the release version)
> >>
> >>
> >>
> >> For Hive release details and downloads, please visit:
> >>
> >> https://hive.apache.org/downloads.html
> >>
> >>
> >>
> >> Hive 4.0.0 Release Notes are available here:
> >>
> >>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12343343&styleName=Text&projectId=12310843
> >>
> >>
> >>
> >> We would like to thank the many contributors who made this release
> >>
> >> possible.
> >>
> >>
> >>
> >> Regards,
> >>
> >>
> >>
> >> The Apache Hive Team
>


Re: Release of Hive 4 and TPC-DS benchmark

2023-11-07 Thread Sungwoo Park
>
> Based on HIVE-26654, it looks like we have 3 PR pending review:
> 1. HIVE-26986 - Query 71
> 2. HIVE-27006 - Query 2
> 3. HIVE-27269 - Query 97 (is that ready to be reviewed?)
>

Yes, Seonggon just submitted a pull request for HIVE-27269. It is not a
simple fix that I originally proposed - it is a complete fix with a test
case.


> We'll prioritize those.
>
> For query 14, as you suggested, we might set
> `hive.optimize.cte.materialize.threshold to -1` by default for now and fix
> it in the following releases.
>

 We will submit this pull request sometime soon.

Thanks,

--- Sungwoo


Release of Hive 4 and TPC-DS benchmark

2023-11-03 Thread Sungwoo Park

Hi everyone,

I would like to resume the discussion on the release of Hive 4 and 
the result of the TPC-DS benchmark.


Currently there are four unresolved JIRAs marked 'hive-4.0.0-must' which must be 
resolved before the release of Hive 4 ([1], [2], [3], [4]). The most urgent one 
is perhaps HIVE-26654 [1] which reports failing queries in the TPC-DS benchmark. 
(All these bugs were introduced after the release of Hive 3.1.2 which passes all 
the TPC-DS tests.)


Originally we reported 7 failing cases in HIVE-26654. Since then, 3 cases have 
been resolved, 2 cases have pull requests, and 2 cases don't have pull requests 
yet.


1. Query 17: Resolved in HIVE-26655 [6]
2. Query 16, 69, 94: Resolved in HIVE-26659 [8]
3. Query 64: Resolved in HIVE-26968 [10]

4. Query 2: Pull request available in HIVE-27006 [5]
5. Query 71: Pull request available in HIVE-26986 [9]

6. Query 14: Reported in HIVE-24167 [7]
7. Query 97: Reported in HIVE-27269 [11]

Seonggon and I (in MR3 team) have been working on these problems, and so far we 
have submitted 4 pull requests. Two of them have been merged, but the other two 
are not being reviewed (for query 2 and query 71). I'd apprecite it very much if 
Hive committers could review the remaining pull requests.


The remainging problems are query 14 and query 97.

For query 14, I suggest that we take a simple workaround by setting 
hive.optimize.cte.materialize.threshold to -1 by default because nobody seems to 
working on this JIRA. If necessary, we could try to fix it after the release of 
Hive 4.


For query 97 (which we think is the most challenging one among all the 
sub-JIRAs), we have a few choices:


1) Use a quick-fix solution by ignoring hive.mapjoin.hashtable.load.threads when 
FullOuterJoin is used

2) Fix HIVE-25583 [12] which introduces this bug
3) Fix it properly

I suggest that we take a quick-fix solution and revisit the problem after the 
release of Hive 4.


(We have also observed performance regression in Hive, but I guess another topic 
to discuss after fixing correctness issues.)


Please let us know what you think.

Thanks,

--- Sungwoo

[1] https://issues.apache.org/jira/browse/HIVE-26654
[2] https://issues.apache.org/jira/browse/HIVE-27226
[3] https://issues.apache.org/jira/browse/HIVE-26505
[4] https://issues.apache.org/jira/browse/HIVE-22636
[5] https://issues.apache.org/jira/browse/HIVE-27006
[6] https://issues.apache.org/jira/browse/HIVE-26655
[7] https://issues.apache.org/jira/browse/HIVE-24167
[8] https://issues.apache.org/jira/browse/HIVE-26659
[9] https://issues.apache.org/jira/browse/HIVE-26986
[10] https://issues.apache.org/jira/browse/HIVE-26968
[11] https://issues.apache.org/jira/browse/HIVE-27269
[12] https://issues.apache.org/jira/browse/HIVE-25583




Re: Introduce Uniffle : A stability solution of Hive's shuffle

2023-09-29 Thread Sungwoo Park
In addition to the two main benefits summarized by Rory, I would like to
add another benefit of using remote shuffle service:

3. If you run large jobs in public clouds, sometimes the amount of local
storage attached to your instances can be a limiting factor. By using
remote shuffle service, you can cut the usage of local storage by half
(because shuffle data is sent to remote shuffle service, rather than
written to local storage).

Although you still need local storage for the remaining half, using remote
shuffle service opens new possibilities of further reducing local storage
(e.g., directly reading from network rather than spilling to local disk).

Thanks,

--- Sungwoo

On Tue, Jul 11, 2023 at 9:48 PM roryqi  wrote:

> Dear Apache Hive community,
>
>
> We are delighted to announce the support of Tez on Uniffle.  Uniffle havs
> supported Apache Spark, Apache,Hadoop MapReduce and Apache Tez.
>
> Uniffle is a remote shuffle service. In several situations, Uniffle will
> provide great help.
>
>1. If you use AWS spot instances or mix resources, tasks may be
>preempted. It will be great if we store shuffle data in the Uniffle and
> we
>can deploy Uniffle on some stable resource. It will improve the
> stability
>of tasks. If tasks are preempted, we won’t recompute tasks if we store
>shuffle in the Uniffle.
>2. For large shuffle jobs, Uniffle can reduce random IO for the jobs.
>Uniffle can improve the performance of jobs. For 1TB MapReduce
> Terasort, 1w
>map tasks, 1w reduce tasks, job performance will increase 30%.
>
> We also welcome pull requests and are eager to see how you might use
> Uniffle to make Hive more user-friendly. More information, you can access
> https://github.com/apache/incubator-uniffle
>
>
> Best
>
> Rory
>


Re: Move to JDK-11

2023-05-31 Thread Sungwoo Park
Hi, everyone.

I have not tested the master branch with Java 11/17 yet, but I would like
to share my experience with testing a fork of branch-3.1 with Java 11/17
(as part of developing Hive-MR3), in case that it can be useful for the
discussion. I merged the patches listed in [1] HIVE-22415 and updated the
Maven configuration for Java 11.

1. Building Hive was fine and I was able to run it with Java 11 as well as
Java 17. So, it seems that the work reported in [1] is indeed complete for
upgrading to Java 11 (and Java 17) and getting Hive to work.

2. However, there was a problem with running tests, so this can be
additional work for upgrading to Java 11.

3. For performance, Java 17 gives about 8 percent of (free) performance
improvement. When tested with 10TB TPC-DS, Java 8 takes 8074 seconds,
whereas Java 17 takes 7415 seconds. Considering the maturity of Hive, I
think this is not a small improvement because almost every query gets some
speedup.

Thanks,

--- Sungwoo

[1] https://issues.apache.org/jira/browse/HIVE-22415


On Thu, Jun 1, 2023 at 3:53 AM Sai Hemanth Gantasala
 wrote:

> Hi All,
>
> I would strongly advocate keeping support for JDK8.
> Between JDK11 and JDK17, Depending on the amount of effort on the upgrade
> I'm inclined towards JDK17 (JDK21 LTS will be released in Sep 2023).
>
> Thanks,
> Sai.
>
> On Wed, May 31, 2023 at 5:39 AM László Bodor 
> wrote:
>
> > *Hi!*
> >
> >
> > *Should we support both JDK-11 & JDK-8?*
> > IMO absolutely yes, let's not break up with JDK-8: according to its
> > lifecycle, it's going to stay with us for a long time.
> >
> > I believe
> > a) we should be able to compile on JDK8, JDK11, and JDK17 (github actions
> > can cover this conveniently in precommit time, like tez
> > )
> > b) the release artifacts should be compatible with JDK8 as long as it is
> > with us.
> >
> > Regards,
> > Laszlo Bodor
> >
> >
> > Butao Zhang  ezt írta (időpont: 2023. máj. 31.,
> Sze,
> > 14:33):
> >
> > > Thanks Ayush for driving this! Good to know that Hive is getting ready
> > for
> > > newer JDK.
> > > From my opinon, if we have more community energy to put into it, we can
> > > support both JDK-11 and JDK-17 like Spark[1]. If we have to  make a
> > choice
> > > between a JDK-11 and JDK-17, i would like to choose the relatively new
> > > version JDK-17, meanwhile, we should maintain compatibility with jdk8,
> as
> > > JDK-8 is still widely used in most big data platforms.
> > >
> > >
> > > Thanks,
> > > Butao Zhang
> > >
> > >
> > > [1]https://issues.apache.org/jira/browse/SPARK-33772
> > >  Replied Message 
> > > | From | Ayush Saxena |
> > > | Date | 5/31/2023 18:39 |
> > > | To | dev |
> > > | Subject | Move to JDK-11 |
> > > Hi Everyone,
> > > Want to pull in the attention of folks towards moving to JDK-11 compile
> > > time support in Hive. There was a ticket in the past [1] which talks
> > about
> > > it and If I could decode it right, it was blocked because the Hadoop
> > > version used by Hive didn't had JDK-11 runtime support, But with [2] in
> > we
> > > have upgraded the Hadoop version, so that problem is sorted out. I
> > couldn't
> > > even see any unresolved tickets in the blocked state either.
> > >
> > > I quickly tried* a  mvn clean install -DskipTests -Piceberg -Pitests
> > > -Dmaven.javadoc.skip=true
> > >
> > > And no surprises it failed with some weird exceptions towards the end.
> > But
> > > I think that should be solvable.
> > >
> > > So, Questions?
> > >
> > > - What do folks think about this? Should we put in some effort towards
> > > JDK-11
> > > - Should we support both JDK-11 & JDK-8?
> > > - Ditch JDK-11 and directly shoot for JDK-17?
> > >
> > > Let me know your thoughts, In case anyone has some experience in this
> > area
> > > and have tried something in the context, feel free to share or may be
> if
> > > someone has any potential action plan or so
> > >
> > > -Ayush
> > >
> > > [1] https://issues.apache.org/jira/browse/HIVE-22415
> > > [2] https://issues.apache.org/jira/browse/HIVE-24484
> > >
> > > * changed the maven.compiler.source & maven.compiler.target to 11
> > >
> >
>


Re: Question on hive.merge.nway.joins,

2023-05-26 Thread Sungwoo Park
Hi Okumin,

I agree that setting hive.merge.nway.joins to true (or false) can have both
positive and negative effects on the performance, depending on the query
and other related configuration parameters. I thought the default value was
set to false for some correctness issue (for
example, mapjoin_filter_on_outerjoin.q fails).

While investigating this issue, I found a bug and created a new JIRA (with
pull-request created by Seonggon). This bug is not easy to reproduce, but
it does occur (causing NPE) when hive.merge.nway.joins is set to true.

https://issues.apache.org/jira/browse/HIVE-27375

Thanks,

--- Sungwoo


On Sat, May 27, 2023 at 12:40 AM おくみん  wrote:

> Hi Sungwoo,
>
> I have totally no idea why we changed the default value. I'm just sharing
> my knowledge and experience.
>
> First, I know there is a known issue when we use it with Tez. We can see
> the following statement on the official website
> <https://cwiki.apache.org/confluence/display/hive/configuration+properties
> >.
>
> > For multiple joins on the same condition, merge joins together into a
> single join operator. This is useful in the case of large shuffle joins to
> avoid a reshuffle phase. Disabling this in Tez will often provide a faster
> join algorithm in case of left outer joins or a general Snowflake schema.
>
> Honestly, I don't know the detail. But I have had one negative experience
> so far. While I was using Hive 2 with `hive.merge.nway.joins=true`, Merge
> Join was applied even though one or two tables are small enough. The
> performance degraded because the largest table has a skew on the join key.
> If I remember correctly, `hive.merge.nway.joins` merges multiple joins in
> an early stage, and some optimization can miss a chance. Of course, I know
> it can also positively work in some cases.
>
> Note that the version I used is a bit old, my memory could be wrong, and
> again I am not sure about the concrete background of HIVE-21189.
>
> Thanks,
> Okumin
>
>
> On Thu, May 25, 2023 at 7:48 PM Sungwoo Park  wrote:
>
> > Hello,
> >
> > In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to
> > false. There is no record of why it was set to false, and I would like to
> > understand the background for the decision. Specifically I wonder if the
> > following situation is relevant to the decision.
> >
> > Example)
> > MapJoinOp_1 joins: table G, table A, table B, table C
> > MapJoinOp_2 joins: table G, table A, table B  , table D
> >
> > Here, table G is a big table to be read via shuffling.
> > MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D.
> > SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and
> > MapJoinOp_2 (because of table G and table A), so that both operators can
> > share in-memory tables.
> >
> > Assume that MapJoinOp_1 is executed first and fills the cache first.
> Then,
> > MapJoinOp_2 does not load the cache which is already filled. As a result,
> > it ends up with something like NullPointerException.
> >
> > After setting hive.merge.nway.joins to true, I encountered a problem
> (which
> > is not easy to reproduce), and I wonder if the above scenario is feasible
> > in the current implementation.
> >
> > Many thanks,
> >
> > --- Sungwoo
> >
> >
> >
> >
> >
> >
> > [1] https://issues.apache.org/jira/browse/HIVE-21189
> >
>


Question on hive.merge.nway.joins,

2023-05-25 Thread Sungwoo Park
Hello,

In HIVE-21189 [1], the default value for hive.merge.nway.joins is set to
false. There is no record of why it was set to false, and I would like to
understand the background for the decision. Specifically I wonder if the
following situation is relevant to the decision.

Example)
MapJoinOp_1 joins: table G, table A, table B, table C
MapJoinOp_2 joins: table G, table A, table B  , table D

Here, table G is a big table to be read via shuffling.
MayJoinOp_1 needs table C, while MapJoinOp_2 needs table D.
SharedWorkOptimizer assigns the same cache key to MapJoinOp_1 and
MapJoinOp_2 (because of table G and table A), so that both operators can
share in-memory tables.

Assume that MapJoinOp_1 is executed first and fills the cache first. Then,
MapJoinOp_2 does not load the cache which is already filled. As a result,
it ends up with something like NullPointerException.

After setting hive.merge.nway.joins to true, I encountered a problem (which
is not easy to reproduce), and I wonder if the above scenario is feasible
in the current implementation.

Many thanks,

--- Sungwoo






[1] https://issues.apache.org/jira/browse/HIVE-21189


Re: [DISCUSS] Nightly snaphot builds

2023-05-22 Thread Sungwoo Park
I think such nightly builds will be useful for testing and debugging in the
future.

I also wonder if we can somehow create builds even from previous commits
(e.g., for the past few years). Such builds from previous commits don't
have to be daily builds, and I think weekly builds (or even monthly builds)
would also be very useful.

The reason I wish such builds were available is to facilitate debugging and
testing. When tested against the TPC-DS benchmark, the current master
branch has several correctness problems that were introduced after the
release of Hive 3.1.2. We have reported all problems known to us in [1] and
also submitted several patches. If such nightly builds had been available,
we would have saved quite a bit of time for implementing the patches by
quickly finding offending commits that introduced new correctness bugs.

In addition, you can find quite a few commits in the master branch that
report bugs which are not reproduced in Hive 3.1.2. Examples: HIVE-19990,
HIVE-14557, HIVE-21132, HIVE-21188, HIVE-21544, HIVE-22114,
HIVE-7, HIVE-22236, HIVE-23911, HIVE-24198, HIVE-22777,
HIVE-25170, HIVE-25864, HIVE-26671.
(There may be some errors in this list because we compared against Hive
3.1.2 with many patches backported.) Such nightly builds can be useful for
finding root causes of such bugs.

Ideally I wish there was an automated procedure to create nightly builds,
run TPC-DS benchmark, and report correctness/performance results, although
this would be quite hard to implement. (I remember Spark implemented this
procedure in the era of Spark 2, but my memory could be wrong.)

[1] https://issues.apache.org/jira/browse/HIVE-26654


On Tue, May 23, 2023 at 10:44 AM Ayush Saxena  wrote:

> Hi Vihang,
> +1, We were even exploring publishing the docker images of the snapshot
> version as well per commit or maybe weekly, so just shoot 2 docker commands
> and you get a Hive cluster running with master code.
>
> Sai, I think to spin up an env via Docker with all these things should be
> doable for sure, but would require someone with real good expertise with
> docker as well as setting up these services with Hive. Obviously, I am not
> that guy :-)
>
> @Simhadri has a PR which publishes docker images once a release tag is
> pushed, you can explore to have similar stuff for the Snapshot version,
> maybe if that sounds cool
>
> -Ayush
>
> On Tue, 23 May 2023 at 04:26, Sai Hemanth Gantasala
>  wrote:
>
> > Hi Vihang,
> >
> > +1 on the idea.
> >
> > This is a great idea to quickly test if a certain feature is working as
> > expected on a certain branch.
> > This way we test data loss, correctness, or any other unexpected
> scenarios
> > that are Hive specific only. However, I'm wondering if it is possible to
> > deploy/test in a kerberized environment or issues involving authorization
> > services like sentry/ranger.
> >
> > Thanks,
> > Sai.
> >
> > On Mon, May 22, 2023 at 11:15 AM vihang karajgaonkar <
> vihan...@apache.org>
> > wrote:
> >
> > > Hello Team,
> > >
> > > I have observed that it is a common use-case where users would like to
> > test
> > > out unreleased features/bug fixes either to unblock them or test out if
> > the
> > > bug fixes really work as intended in their environments. Today in the
> > case
> > > of Apache Hive, this is not very user friendly because it requires the
> > end
> > > user to build the binaries directly from the hive source code.
> > >
> > > I found that Apache Spark has a very useful infrastructure [1] which
> > > deploys nightly snapshots [2] [3] from the branch using github actions.
> > > This is super useful for any user who wants to try out the latest and
> > > greatest using the nightly builds.
> > >
> > > I was wondering if we should also adopt this. We can use github actions
> > to
> > > upload the snapshot jars to the public repository (e.g github packages)
> > and
> > > schedule it as a nightly job.
> > >
> > > [1] https://issues.apache.org/jira/browse/INFRA-21167
> > > [2]
> https://github.com/apache/spark/pkgs/container/apache-spark-ci-image
> > > [3] https://github.com/apache/spark/pull/30623
> > >
> > > I can take a stab at this if the community thinks that this is a nice
> > thing
> > > to have.
> > >
> > > Thanks,
> > > Vihang
> > >
> >
>


Re: Request to join Hive slack channel

2023-05-18 Thread Sungwoo Park
I am sorry for spamming -- My email address is: glap...@gmail.com

Thanks,
--- Sungwoo Park

On Fri, May 19, 2023 at 3:11 PM Sungwoo Park  wrote:

> If non-committers can join the slack channel, I would like to join, too.
> An invitation will be appreciated very much (glapa...@gmail.com).
>
> Thanks,
>
> --- Sungwoo Park
>
>
> On Fri, May 19, 2023 at 2:49 PM Butao Zhang  wrote:
>
>> Hi, Hive dev
>>
>>
>> I just saw this updated page:
>> https://cwiki.apache.org/confluence/display/Hive/HowToCommit. It seems
>> we individual  can request to join the slack channel.
>> If that is possible, I want to join the slack, please give me a
>> invitation, Thanks.
>>
>>
>> My Gmail address:  butaozha...@gmail.com
>>
>>
>>
>> Thanks,
>>
>> Butao Zhang
>
>


Re: Request to join Hive slack channel

2023-05-18 Thread Sungwoo Park
If non-committers can join the slack channel, I would like to join, too. An
invitation will be appreciated very much (glapa...@gmail.com).

Thanks,

--- Sungwoo Park


On Fri, May 19, 2023 at 2:49 PM Butao Zhang  wrote:

> Hi, Hive dev
>
>
> I just saw this updated page:
> https://cwiki.apache.org/confluence/display/Hive/HowToCommit. It seems we
> individual  can request to join the slack channel.
> If that is possible, I want to join the slack, please give me a
> invitation, Thanks.
>
>
> My Gmail address:  butaozha...@gmail.com
>
>
>
> Thanks,
>
> Butao Zhang


Re: Can we get someone to review the PR for HIVE-24915?

2023-05-12 Thread Sungwoo Park
Hi,

HIVE-25170 fixes the same bug as in your pull request.

Thanks,

--- Sungwoo

On Fri, May 12, 2023 at 4:04 PM Suprith Chandrashekharachar <
suprith.chandrashekharac...@treasure-data.com> wrote:

> Hi,
>
> I opened this ticket about 2 years ago hoping to get a review. I didn't
> hear any feedback for sometime and later it fell off my radar as I started
> working on other projects. Recently, I wanted to open another issue and I
> bumped into the old ticket that I had created
> https://issues.apache.org/jira/browse/HIVE-24915. It would be great if I
> could get someone to review it.
>
> Thanks,
> Suprith
>


Re: Introducing a DI framework in Hive?

2023-04-13 Thread Sungwoo Park
I would like to add another question to the list of Laszlo.

4) When a specific DI framework is chosen, what kinds of new dependencies
will be introduced? (Are they conflicting with existing dependencies of
Hive?)

Regards,

--- Sungwoo Park


On Thu, Apr 13, 2023 at 4:43 PM László Bodor 
wrote:

> Thanks, guys for putting DI into scope, sounds very interesting, just a
> couple of questions to help me understand and move this forward (and maybe
> involve more folks with DI experience):
>
> 1) Can we have some examples, even with dummy code snippet-level, about
> what we want to achieve? I mean, "utility classes with static methods are
> bd" is not an example, even if I agree to a certain extent.
> 2) Yes, DI helps with testing, but the question is, whether injecting will
> happen only in tests or in production parts as well.
> 3) What's the primary thing/object in your mind when it comes to injecting
> something in the scope of Hive?
>
> TLDR: I remember an earlier experience with Spring when
> it @InjectedWhateverIWantedWithAwesomeAnnotations, that's what I need to
> see examples for in case of hive.
>
> Regards,
> Laszlo Bodor
>
>
>
> Stamatis Zampetakis  ezt írta (időpont: 2023. ápr. 13.,
> Cs, 9:33):
>
> > Just to be clear, I am in favor of introducing DI frameworks in Hive
> > where it makes sense. As Attila said, we don't want to get stuck with
> > legacy code forever. When a concrete proposal comes up we can discuss
> > benefits vs drawbacks.
> >
> > Regarding stability I agree it is a pressing issue but Hive is an open
> > source project and we certainly don't want to force volunteers to work
> > on specific things or forbid them to work on others. Contributing to
> > open source is supposed to be a fun and rewarding experience. I am
> > sure many of the people in this list have stability as a primary goal
> > so eventually we will get there.
> >
> > Best,
> > Stamatis
> >
>


Re: Introducing a DI framework in Hive?

2023-04-13 Thread Sungwoo Park
Hi  Stamatis,

For the correctness issue, we wanted to solve the problem ourselves and
have made a few pull requests in [1] so far. (We would like to  kindly
request Hive committers to review the pull requests.) For HIVE-27226, we
are working on a solution and will create a pull request when a solution is
ready. For the stability issue, we have not made much progress, but when
initial results become available, let me report in this mailing list.

Regards,

--- Sungwoo


On Thu, Apr 13, 2023 at 4:33 PM Stamatis Zampetakis 
wrote:

> Just to be clear, I am in favor of introducing DI frameworks in Hive
> where it makes sense. As Attila said, we don't want to get stuck with
> legacy code forever. When a concrete proposal comes up we can discuss
> benefits vs drawbacks.
>
> Regarding stability I agree it is a pressing issue but Hive is an open
> source project and we certainly don't want to force volunteers to work
> on specific things or forbid them to work on others. Contributing to
> open source is supposed to be a fun and rewarding experience. I am
> sure many of the people in this list have stability as a primary goal
> so eventually we will get there.
>
> Best,
> Stamatis
>


Re: Introducing a DI framework in Hive?

2023-04-12 Thread Sungwoo Park
Hello,

I am not a committer, but I would like to add my opinion. At this stage of
development, I think it is quite risky to switch to a DI framework for a
couple of reasons.

1. A DI framework would have been a powerful tool if it had been
incorporated into the project from the early stage. Now, however, Hive has
way over 1 million lines of code and tens of thousands test cases, and my
guess is that the overhead associated with introducing DI into Hive
(whether gradually or globally at once) is very likely to outweigh the
additional benefit, if any, of introducing DI, especially if we consider
the stability of its development infrastructure.

2. Implementing new features, such as DI, in Hive can be an exciting
sub-project and fun, but I think more pressing issues are to stabilize the
current Hive code, although this is certainly less motivating and more
boring. I hope that no new major features, such as DI, will be introduced
until Hive becomes, say, as stable as Hive 3.1.

For 2, I can give a few examples to substantiate my claim.

1) For the past few years, several new techniques for query compilation
have been introduced. Unfortunately they were buggy and Hive started to
return wrong results, on the assumption that Hive 3.1.2 was working
correctly. (Yes, Hive 3.1.2 also has correctness bugs, but when tested
against TPC-DS, Hive 3.1.2 returned the same results as other frameworks,
so it can be used as a basis for comparison.) From our own testing, Hive
4.0.0-SNAPSHOT returns wrong results on several queries in TPC-DS, and this
should be a major setback for Hive. If interested, please see [1] and [2].

2) Perhaps due to the same reason as in 1), Hive 4.0.0-SNAPSHOT is
noticeably slower than Hive 3.1.2 on the TPC-DS benchmark. However, this is
only from my own testing (using 10TB TPC-DS), and I hope that someone in
the Hive team will try similar experiments to confirm/refute my claim.

3) Currently many q tests are run against MapReduce (which is not
officially supported as far as I remember). However, some of these q tests
fail when run against Tez. If Tez and LLAP are the new execution engines,
these tests should be migrated as well.

Sungwoo Park

[1] https://issues.apache.org/jira/browse/HIVE-26654
[2] https://issues.apache.org/jira/browse/HIVE-27226

On Wed, Apr 12, 2023 at 10:12 PM Stamatis Zampetakis 
wrote:

> Hey Laszlo,
>
> Dependency injection is a very powerful and useful tool/design pattern.
>
> I don't think there is a particular reason for which Hive does not use
> DI framework apart maybe from the fact that we have lots of legacy
> code that existed before DI became that popular.
>
> I am open to ideas and suggestions about parts of the code that we
> could improve via DI. I would probably avoid big refactorings to core
> components of Hive for the sake of introducing a DI framework but I
> see no big issue using such frameworks in new code. As usual when we
> are about to introduce a new dependency to the project we should be
> mindful of all the implications that this might have.
>
> It's hard to make a generally applicable claim that we should use this
> or that framework since I guess it has to do a lot with personal
> preferences; we tend to prefer things that we have already used. I
> haven't used DI frameworks that much so don't have a strong opinion on
> which framework is the best so I am willing to follow the majority.
>
> Best,
> Stamatis
>
> On Tue, Apr 4, 2023 at 1:19 PM Laszlo Vegh 
> wrote:
> >
> >
> > Hi all,
> >
> > I would like to start a conversation about introducing some Dependency
> Injection framework (like Spring, Guice, Weld, etc.) in Hive.
> >
> > IMHO the lack of such framework makes the codebase way less organised,
> and harder to maintain. Moreover, I think it also lead to introducing a
> huge amount of static/utility methods and classes (which is highly
> discouraged when using DI frameworks). When there is no DI framework,
> utility classes with static methods often seem to be the simplest and best
> way to share code across different Hive components/classes, but these
> constructs are really killing testability. For example it is much harder to
> mock static method calls, than mocking service/component instances. Poor
> testability is a major issue on its own, but having a DI framework could
> have much more benefit, like greater flexibility (modularity), better
> organised services, etc.
> >
> >
> > I’m interested if there’s any reason why there is no DI in Hive so far.
> I know there’s no way to introduce it everywhere in a single step, but we
> could start using it where it is easy to start, and continuously expand its
> usage from class to class. If there is no strong reason why no to do it, I
> would like to start an open conversation around this topi

Re: [DISCUSS] Move Jira notification emails out of dev@hive

2023-03-26 Thread Sungwoo Park
I like the proposal very much. (Then, hopefully this mailing list will 
be useful to outside contributors as well.)


--- Sungwoo Park

On Sat, 25 Mar 2023, Stamatis Zampetakis wrote:


Hi everyone,

In the last Hive board report someone mentioned that the volume of Jira
notification emails to the dev list is huge especially when compared to
emails send by actual humans making it hard for someone to follow what's
happening in the project.

I personally share their viewpoint. For a long time I have been relying on
client side (Gmail) filters to separate Jira notifications from other
emails to the dev list.

I think it would be better to direct the traffic from jira to a separate
list namely jira@hive to keep the dev@hive list clean and dedicated to
human interaction.

What do you think?

Best,
Stamatis



Re: [DISCUSS] HIVE 4.0 GA Release Proposal

2023-03-22 Thread Sungwoo Park
For correctness, we can merge a few pull requests and change the default values 
of a few configuration parameters, so that we can get the correct results for 
the TPC-DS benchmark.


Another issue is a performance regression when compared with Hive 3.1. I ran the 
TPC-DS benchmark using a scale factor of 10TB. Our internal testing shows that 
the current snapshot of Hive 4 is 1.5 times slower than Hive 3.1. Here is a 
summary of our internal testing on a cluster with 13 nodes, each with 256GB 
memory and 6 SSDs.


Systems compared:

1. Trino 417 (using Java 11)
2. Hive 3.1 (a fork maintained by us)
3. Hive 4.0.0-SNAPSHOT (as of February 2023)

Results:

1. Trino 417
total execution time = 9633 seconds, geometric mean = 28.19 seconds
query 21 returns wrong results.
query 23 returns wrong results.
query 72 fails (with query.max-memory = 1440GB)

2. Hive 3.1
total execution time = 9900 seconds, geometric mean = 31.67 seconds 
All the 99 queries return correct results.


3. Hive 4.0.0-SNAPSHOT
total execution time = 10584 seconds, geometric mean = 43.72 seconds
All the 99 queries return correct results.

Around the summer 2020, Hive 4.0.0-SNAPSHOT was noticeably faster than Hive 3.1, 
although a few queries returned wrong results.


Not sure about how to fix the performance regression. Git bisecting is not a 
practical option because 1) until last year, building 4.0.0-SNAPSHOT was not 
smooth because of Tez dependency; 2) loadig 10TB TPC-DS data for each commit is 
too much an overhead.


I am thinking about comparing DAG plans from Hive 3.1 and 4.0.0-SNAPSHOT for 
those queries that exhibit performance regression. If you have any suggestion, 
please let me know.


--- Sungwoo

On Tue, 21 Mar 2023, Stamatis Zampetakis wrote:


Many thanks for running tests with 4.0.0 Sungwoo; it is invaluable
help for getting out a stable Hive 4.

I will review https://issues.apache.org/jira/browse/HIVE-26968 in the
coming weeks; I have assigned myself as reviewer in the PR.

Can some other people (committers or not) help in reviewing the
remaining TPC-DS blockers for which we have a PR?

Reminder: Good non-binding reviews are important and much appreciated
by the community. They are also among the important metrics for
becoming a Hive committer/PMC [1].

Best,
Stamatis

[1] https://cwiki.apache.org/confluence/display/Hive/BecomingACommitter

On Tue, Mar 14, 2023 at 12:07?PM Sungwoo Park  wrote:


Hello,

I would like to expand the list of blockers with HIVE-27138 [1] which fixes NPE
on mapjoin_filter_on_outerjoin.q.

Currently mapjoin_filter_on_outerjoin.q is tested with MapReduce execution
engine and shows no problem. However, it shows a few problems when tested with
Tez execution engine. HIVE-27138 is the first fix found after analyzing
mapjoin_filter_on_outerjoin.q, and Seonggon will create a couple more tickets
later.

In the meanwhile, it would be great if someone could review pull requests for
subtasks in HIVE-26654. (I moved to HIVE-26654 three tickets that I previously
requested code review for.)

Best,

--- Sungwoo
  [1] https://issues.apache.org/jira/browse/HIVE-27138

On Fri, 10 Mar 2023, Stamatis Zampetakis wrote:


Hi Kirti,

Thanks for bringing up this topic.

The master branch already has many new features; we don't need to wait for
more to cut a GA.

The main criterion for going GA is stability thus I would consider
regressions as the only blockers for the release.

If I recall well the only regressions discovered so far are some problems
with TPC-DS queries so basically HIVE-26654 [1].

I will let others chime in to include more tickets if necessary.

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-26654


On Wed, Mar 8, 2023 at 10:02?AM Kirti Ruge  wrote:


Hello Hive Dev,

It has been about 6 months since Hive-4.0-alpha-2 was released in Nov 2022.
Would it be a good time to discuss about HIVE-4.0 GA  release to the
community ? Can we have discussion on the new features/jdk support versions
which we want to publish as part of 4.0 GA , timeframe of release.


Thanks,
Kirti






Re: [DISCUSS] HIVE 4.0 GA Release Proposal

2023-03-14 Thread Sungwoo Park

Hello,

I would like to expand the list of blockers with HIVE-27138 [1] which fixes NPE 
on mapjoin_filter_on_outerjoin.q.


Currently mapjoin_filter_on_outerjoin.q is tested with MapReduce execution 
engine and shows no problem. However, it shows a few problems when tested with 
Tez execution engine. HIVE-27138 is the first fix found after analyzing 
mapjoin_filter_on_outerjoin.q, and Seonggon will create a couple more tickets 
later.


In the meanwhile, it would be great if someone could review pull requests for 
subtasks in HIVE-26654. (I moved to HIVE-26654 three tickets that I previously 
requested code review for.)


Best,

--- Sungwoo
 [1] https://issues.apache.org/jira/browse/HIVE-27138

On Fri, 10 Mar 2023, Stamatis Zampetakis wrote:


Hi Kirti,

Thanks for bringing up this topic.

The master branch already has many new features; we don't need to wait for
more to cut a GA.

The main criterion for going GA is stability thus I would consider
regressions as the only blockers for the release.

If I recall well the only regressions discovered so far are some problems
with TPC-DS queries so basically HIVE-26654 [1].

I will let others chime in to include more tickets if necessary.

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/HIVE-26654


On Wed, Mar 8, 2023 at 10:02?AM Kirti Ruge  wrote:


Hello Hive Dev,

It has been about 6 months since Hive-4.0-alpha-2 was released in Nov 2022.
Would it be a good time to discuss about HIVE-4.0 GA  release to the
community ? Can we have discussion on the new features/jdk support versions
which we want to publish as part of 4.0 GA , timeframe of release.


Thanks,
Kirti




[jira] [Created] (HIVE-27134) SharedWorkOptimizer merges TableScan operators that have different DPP parents

2023-03-11 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-27134:
---

 Summary: SharedWorkOptimizer merges TableScan operators that have 
different DPP parents
 Key: HIVE-27134
 URL: https://issues.apache.org/jira/browse/HIVE-27134
 Project: Hive
  Issue Type: Sub-task
Reporter: Sungwoo Park






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Asking for code review: HIVE-26968, HIVE-26986, HIVE-27006

2023-02-14 Thread Sungwoo Park

Hello Alessandro,

Thank you for the comment. HIVE-27006 makes sense only after HIVE-26986 is 
merged, so the failing tests can be taken care of later. HIVE-27006 does not 
affect the results of TPC-DS queries, so Seunggon (who created the JIRAs) can 
focus on HIVE-26968 first.


To reproduce the bugs, one should build TPC-DS datasets (using Iceberg 
for HIVE-26968) and execute queries. Checking the output plan using a Metastore 
image is not enough.


Best regards,

--- Sungwoo

On Tue, 14 Feb 2023, Alessandro Solimando wrote:


Hi Sungwoo,
thanks for bringing this up, IMO correctness issues should be set to
"Blocker" level in Jira, therefore no 4.0.0 should be released before
fixing the aforementioned tickets.

The patches seem well though and solid from a cursory look, but they
fall outside of my area of expertise, I don't have time right now to review
them because I would first need to understand Shared Work Optimizer first,
which is non-trivial.

I have nonetheless approved the blocked workflows (for first time
contributors some need a committer to run them), I have also noticed
that HIVE-27006 has failing tests, so in the meantime those failures could
be addressed.

Another action that will probably get you closer to having the PRs in is to
address (some of) the code smells/issues that Sonar has identified (from a
cursory look there were some unused imports etc.), the neater the PR the
lesser the time a reviewer will need, the higher the chances they get
reviewed.

Best regards,
Alessandro

On Tue, 14 Feb 2023 at 15:06, Sungwoo Park  wrote:


Seonggon created three JIRAs a while ago which affect the result of TPC-DS
queries,
and I wonder if anyone would have time for reviewing the pull requests.

HIVE-26968: SharedWorkOptimizer merges TableScan operators that have
different DPP parents
HIVE-26986: A DAG created by OperatorGraph is not equal to the Tez DAG.
HIVE-27006: ParallelEdgeFixer inserts misconfigured operator and does not
connect it in Tez DAG

In the current build, TPC-DS query 64 returns wrong results (no rows) on
Iceberg tables.
This is fixed in HIVE-26968.

TPC-DS query 71 fails with an error ("cannot find _col0 from []").
This is fixed in HIVE-26986.

HIVE-27006 fixes a bug which we found while testing with TPC-DS queries.
(It depends on HIVE-26986.)

I hope these JIRAs are merged to the master branch before the release of
Hive 4.0.0.
Considering the maturity of Hive and the impending release of Hive 4.0.0,
it does not seem like a good plan to release Hive 4.0.0 that fails on some
TPC-DS queries.

Thanks!

Sungwoo Park





Asking for code review: HIVE-26968, HIVE-26986, HIVE-27006

2023-02-14 Thread Sungwoo Park
Seonggon created three JIRAs a while ago which affect the result of TPC-DS queries, 
and I wonder if anyone would have time for reviewing the pull requests.


HIVE-26968: SharedWorkOptimizer merges TableScan operators that have different 
DPP parents
HIVE-26986: A DAG created by OperatorGraph is not equal to the Tez DAG.
HIVE-27006: ParallelEdgeFixer inserts misconfigured operator and does not 
connect it in Tez DAG

In the current build, TPC-DS query 64 returns wrong results (no rows) on 
Iceberg tables.
This is fixed in HIVE-26968.

TPC-DS query 71 fails with an error ("cannot find _col0 from []").
This is fixed in HIVE-26986.

HIVE-27006 fixes a bug which we found while testing with TPC-DS queries.
(It depends on HIVE-26986.)

I hope these JIRAs are merged to the master branch before the release of Hive 
4.0.0.
Considering the maturity of Hive and the impending release of Hive 4.0.0,
it does not seem like a good plan to release Hive 4.0.0 that fails on some 
TPC-DS queries.

Thanks!

Sungwoo Park


[jira] [Created] (HIVE-27082) AggregateStatsCache.findBestMatch() in Metastore should test the inclusion of default partition name

2023-02-14 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-27082:
---

 Summary: AggregateStatsCache.findBestMatch() in Metastore should 
test the inclusion of default partition name
 Key: HIVE-27082
 URL: https://issues.apache.org/jira/browse/HIVE-27082
 Project: Hive
  Issue Type: Improvement
  Components: Standalone Metastore
Affects Versions: 4.0.0-alpha-2, 3.1.3
Reporter: Sungwoo Park
Assignee: Sungwoo Park


This pull request deals with non-determinisitic behavior of hive in generating 
DAGS. From the discussion thread:

The non-determinstic behavior of Hive in generating DAGs is due to the logic in 
AggregateStatsCache.findBestMatch() called from AggregateStatsCache.get(), as 
well as the disproportionate distribution of Nulls in HIVE_DEFAULT_PARTITION.

Here is what is happening in the case of the TPC-DS dataset. Let us use 
web_sales table and ws_web_site_sk column in the 10TB TPC-DS dataset as a 
running example.

In the course of running TPC-DS queries, Hive asks MetaStore about the column 
statistics of 1823 partNames in the web_sales/ws_web_site_sk combination, 
either without HIVE_DEFAULT_PARTITION or with HIVE_DEFAULT_PARTITION.

--- Without HIVE_DEFAULT_PARTITION, it reports a total of 901180 nulls.

--- With HIVE_DEFAULT_PARTITION, however, it report a total of 1800087 nulls, 
almost twice as many.

The first call to MetaStore returns the correct result, but all subsequent 
requests are likely to return the same result from the cache, irrespective of 
the inclusion of HIVE_DEFAULT_PARTITION. This is because 
AggregateStatsCache.findBestMatch() treats HIVE_DEFAULT_PARTITION in the same 
way as other partNames, and the difference in the size of partNames[] is just 
1. The outcome depends on the duration of intervening queries, so everything is 
now non-deterministic.

If a wrong value of numNulls is returned, Hive generates a different DAG which 
make takes much longer than the correct one. The problem is particularly 
pronounced here because of the huge number of nulls in HIVE_DEFAULT_PARTITION. 
It is ironic to see that the query optimizer is so efficient that a single 
wrong guess of numNulls creates a very inefficient DAG.

Note that this behavior cannot be avoided by setting 
hive.metastore.aggregate.stats.cache.max.variance to zero because the 
difference in the number of partNames[] between the argument and the entry in 
the cache is just 1.

So, AggregateStatsCache.findBestMatch() should treat HIVE_DEFAULT_PARTITION in 
a special way, by not returning the result in the cache if there is a 
difference in the inclusion of partName HIVE_DEFAULT_PARTITION (or should 
provide the use with an option to activate this feature).




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Result of the TPC-DS benchmark using Iceberg,

2022-11-28 Thread Sungwoo Park
For query 22, iceberg.mr.split.size affects the number of mappers. With the 
default value of 128MB, Hive creates much fewer mappers than it does on ORC 
tables.


For query 64, it is due to a bug in shared work optimization. Setting 
hive.optimize.shared.work.extended to false produces correct results for query 
64.


Because of several bugs in shared work optimization (and parallel edge fixer), 
it might make sense to set the default value of 
hive.optimize.shared.work to false in HiveConf.java.


--- Sungwoo

On Fri, 18 Nov 2022, Sungwoo Park wrote:


Hello Stamatis,

We use a recent or the latest commit in the master branch and run Hive on Tez 
0.10.2.


For query 22, the slow execution seems to be related to the split size used 
in IcebergInputFormat.getSplits(). We will try to create a JIRA when we make 
more progress.


For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but there 
is a separate report that the result is correct on 100GB TPC-DS. Not sure why 
this happens, so we are going to run more experiments.


Best,

Sungwoo

On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:


Hi Sungwoo,

Many thanks for sharing your findings; interesting observations.

If you can please also share the project versions that you used for running
the experiments.

Best,
Stamatis

On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park  wrote:


Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.

Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)

4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002

--- Sungwoo










Re: Result of the TPC-DS benchmark using Iceberg,

2022-11-18 Thread Sungwoo Park

Hello Stamatis,

We use a recent or the latest commit in the master branch and run Hive on Tez 
0.10.2.


For query 22, the slow execution seems to be related to the split size used in 
IcebergInputFormat.getSplits(). We will try to create a JIRA when we make more 
progress.


For query 64, the result is wrong (returning 0 rows) on 1TB TPC-DS, but there 
is a separate report that the result is correct on 100GB TPC-DS. Not sure why 
this happens, so we are going to run more experiments.


Best,

Sungwoo

On Thu, 17 Nov 2022, Stamatis Zampetakis wrote:


Hi Sungwoo,

Many thanks for sharing your findings; interesting observations.

If you can please also share the project versions that you used for running
the experiments.

Best,
Stamatis

On Tue, Nov 15, 2022 at 12:46 PM Sungwoo Park  wrote:


Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and
Iceberg,
and would like to share the result for those interested in Hive using
Iceberg.
The experiment used 1TB TPC-DS dataset stored as ORC.

Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for
the
first Map vertex.)

4. Out of 99 queries, 98 queries return correct results, but query 64
returns
wrong results (returning 0 rows) due to an exception:

org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:

hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002

--- Sungwoo








Result of the TPC-DS benchmark using Iceberg,

2022-11-15 Thread Sungwoo Park

Hello,

I ran the TPC-DS benchmark using Metastore (in the traditional way) and Iceberg, 
and would like to share the result for those interested in Hive using Iceberg. 
The experiment used 1TB TPC-DS dataset stored as ORC.


Here are a few findings.

1. Overall, Hive-Iceberg runs slightly faster than Hive-Metastore.

2. Some queries run much faster with Hive-Iceberg. Examples)
query 14-1) Hive-Metastore: 61 seconds, Hive-Iceberg: 28 seconds
query 78) Hive-Metastore: 141 seconds, Hive-Iceberg: 58 seconds

3. Some queries run much slower with Hive-Iceberg. Example)
query 22: Hive-Metastore: 32 seconds, Hive-Iceberg: 356 seconds
(The slow execution is due to InputInitializer generating only 4 tasks for the 
first Map vertex.)


4. Out of 99 queries, 98 queries return correct results, but query 64 returns 
wrong results (returning 0 rows) due to an exception:


org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: 
hdfs://blue0:8020/tmp/hive/user/35d3bdd7-4fda-4f3d-818d-048ad6242072/hive_2022-11-14_15-26-21_045_8992557056967167667-1/-mr-10001/.hive-staging_hive_2022-11-14_15-26-21_045_8992557056967167667-1/-ext-10002


--- Sungwoo





[jira] [Created] (HIVE-26732) Iceberg uses "null" and does not use the configuration key "hive.exec.default.partition.name" for default partitions.

2022-11-13 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26732:
---

 Summary: Iceberg uses "null" and does not use the configuration 
key "hive.exec.default.partition.name" for default partitions.
 Key: HIVE-26732
 URL: https://issues.apache.org/jira/browse/HIVE-26732
 Project: Hive
  Issue Type: Bug
  Components: Iceberg integration
Affects Versions: 4.0.0-alpha-1
        Reporter: Sungwoo Park


When creating an Iceberg table from an existing ORC table with "insert 
overwrite", the directory corresponding to the default partition uses "null" 
instead of the value for the configuration key 
"hive.exec.default.partition.name".

For example, we create a Iceberg table from an existing ORC table 
tpcds_bin_partitioned_orc_1000.catalog_sales:
{code:java}
create table catalog_sales ( cs_sold_time_sk     bigint, cs_ship_date_sk     
bigint, cs_bill_customer_sk   bigint, cs_bill_cdemo_sk    bigint, 
cs_bill_hdemo_sk    bigint, cs_bill_addr_sk     bigint, cs_ship_customer_sk   
bigint, cs_ship_cdemo_sk    bigint, cs_ship_hdemo_sk    bigint, cs_ship_addr_sk 
    bigint, cs_call_center_sk   bigint, cs_catalog_page_sk    bigint, 
cs_ship_mode_sk     bigint, cs_warehouse_sk     bigint, cs_item_sk      bigint, 
cs_promo_sk     bigint, cs_order_number     bigint, cs_quantity     int, 
cs_wholesale_cost   double, cs_list_price     double, cs_sales_price    double, 
cs_ext_discount_amt   double, cs_ext_sales_price    double, 
cs_ext_wholesale_cost   double, cs_ext_list_price   double, cs_ext_tax      
double, cs_coupon_amt     double, cs_ext_ship_cost    double, cs_net_paid     
double, cs_net_paid_inc_tax   double, cs_net_paid_inc_ship  double, 
cs_net_paid_inc_ship_tax  double, cs_net_profit     double) partitioned by 
(cs_sold_date_sk bigint) STORED BY ICEBERG stored as orc;
insert overwrite table catalog_sales select * from 
tpcds_bin_partitioned_orc_1000.catalog_sales;
{code}
Iceberg creates a directory for the default partition like:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=null

which should be:

/hive/warehouse/tpcds_bin_partitioned_orc_1000_iceberg.db/catalog_sales/data/cs_sold_date_sk=__HIVE_DEFAULT_PARTITION__

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26668) Upgrade ORC version to 1.6.11

2022-10-25 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26668:
---

 Summary: Upgrade ORC version to 1.6.11
 Key: HIVE-26668
 URL: https://issues.apache.org/jira/browse/HIVE-26668
 Project: Hive
  Issue Type: Bug
Reporter: Sungwoo Park


With ORC 1.6.9, setting hive.exec.orc.default.compress to ZSTD can generate 
IllegalStateException (e.g., when loading ORC tables). This is fixed in ORC-965.
{code:java}
Caused by: java.lang.IllegalStateException: Overflow detected
  at io.airlift.compress.zstd.Util.checkState(Util.java:59)
  at 
io.airlift.compress.zstd.BitOutputStream.close(BitOutputStream.java:85){code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26660) TPC-DS query 71 returns wrong results

2022-10-22 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26660:
---

 Summary: TPC-DS query 71 returns wrong results
 Key: HIVE-26660
 URL: https://issues.apache.org/jira/browse/HIVE-26660
 Project: Hive
  Issue Type: Bug
Reporter: Sungwoo Park


TPC-DS query 71 returns wrong results when tested with 100GB dataset.

 

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26659) TPC-DS query 16, 69, 94 return wrong results.

2022-10-22 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26659:
---

 Summary: TPC-DS query 16, 69, 94 return wrong results.
 Key: HIVE-26659
 URL: https://issues.apache.org/jira/browse/HIVE-26659
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0-alpha-2
Reporter: Sungwoo Park


TPC-DS query 16, 69, 94 return wrong results when hive.auto.convert.anti.join 
is set to true.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26655) TPC-DS query 17 returns wrong results

2022-10-20 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26655:
---

 Summary: TPC-DS query 17 returns wrong results
 Key: HIVE-26655
 URL: https://issues.apache.org/jira/browse/HIVE-26655
 Project: Hive
  Issue Type: Bug
Reporter: Sungwoo Park
 Fix For: 4.0.0-alpha-2


When tested with 100GB ORC tables, the number of rows returned by query 17 is 
not stable. It returns fewer rows than the correct result (55 rows).

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HIVE-26654) Test with the TPC-DS benchmark

2022-10-20 Thread Sungwoo Park (Jira)
Sungwoo Park created HIVE-26654:
---

 Summary: Test with the TPC-DS benchmark 
 Key: HIVE-26654
 URL: https://issues.apache.org/jira/browse/HIVE-26654
 Project: Hive
  Issue Type: Bug
Affects Versions: 4.0.0-alpha-2
Reporter: Sungwoo Park


This Jira reports the result of running system tests using the TPC-DS 
benchmark. The test scenario is:

1) create a database consisting of external tables from a 100GB or 1TB TPC-DS 
text dataset
2) load a database consisting of ORC tables
3) compute column statistics
4) run TPC-DS queries
5) check the results for correctness

For step 5), we will compare the results against Hive 3 (which has been tested 
against SparkSQL and Presto). We use Hive on Tez.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: Start releasing the master branch

2022-03-01 Thread Sungwoo Park

Hello Alessandro,

For the latest commit, loading ORC tables fails (with the log message shown 
below). Let me try to find a commit that introduces this bug and create a JIRA 
ticket.


--- Sungwoo

2022-03-02 05:41:56,578 ERROR [Thread-73] exec.StatsTask: Failed to run stats 
task
java.io.IOException: org.apache.hadoop.mapred.InvalidInputException: Input path 
does not exist: 
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-1/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:622)
  at 
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.constructColumnStatsFromPackedRows(ColStatsProcessor.java:105)
  at 
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.persistColumnStats(ColStatsProcessor.java:200)
  at 
org.apache.hadoop.hive.ql.stats.ColStatsProcessor.process(ColStatsProcessor.java:93)

  at org.apache.hadoop.hive.ql.exec.StatsTask.execute(StatsTask.java:107)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:212)
  at 
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)

  at org.apache.hadoop.hive.ql.exec.TaskRunner.run(TaskRunner.java:83)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: 
hdfs://blue0:8020/tmp/hive/gitlab-runner/a236e1b4-b354-4343-b900-3d71b1bc7504/hive_2022-03-02_05-40-50_966_446574755576325031-1/-mr-1/.hive-staging_hive_2022-03-02_05-40-50_966_446574755576325031-1/-ext-10001
  at 
org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:294)
  at 
org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:236)
  at 
org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:45)
  at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:322)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.generateWrappedSplits(FetchOperator.java:435)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextSplits(FetchOperator.java:402)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:306)
  at 
org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:560)

  ... 7 more

On Tue, 1 Mar 2022, Alessandro Solimando wrote:


Hi Sungwoo,
last time I tried to run TPCDS-based benchmark I stumbled upon a similar
situation, finally I found that statistics were not computed, so CBO was
not kicking in, and the automatic retry goes with CBO off which was failing
for something like 10 queries (subqueries cannot be decorrelated, but also
some runtime errors).

Making sure that (column) statistics were correctly computed fixed the
problem.

Can you check if this is the case for you?

HTH,
Alessandro

On Tue, 1 Mar 2022 at 15:28, POSTECH CT  wrote:


Hello Hive team,

I wonder if anyone in the Hive team has tried the TPC-DS benchmark on
the master branch recently.  We occasionally run TPC-DS system tests
using the master branch, and the tests don't succeed completely. Here
is how our TPC-DS tests proceed.

1. Compile and run Hive on Tez (not Hive-LLAP)
2. Load ORC tables from 1TB TPC-DS raw text data, and compute statistics
3. Run 99 TPC-DS queries which were slightly modified to return
varying number of rows (rather than 100 rows)
4. Compare the results against the previous results

The previous results were obtained and cross-checked by running Hive
3.1.2 and SparkSQL 2.3/3.2, so we are faily confident about their
correctness.

For the latest commit in the master branch, step 2 fails. For earlier
commits (for example, commits in February 2021), step 3 fails where
several queries either fail or return wrong results.

We can compile and report the test results in this mailing list, but
would like to know if similar results have been reproduced by the Hive
team, in order to make sure that we did not make errors in our tests.

If it is okay to open a JIRA ticket that only reports failures in the
TPC-DS test, we could also perform git bi-sect to locate the commit
that begin to generate wrong results.

--- Sungwoo Park

On Tue, 1 Mar 2022, Zoltan Haindrich wrote:


Hey,

Great to hear that we are on the same side regarding these things :)

For around a week now - we have nightly builds for the master branch:
http://ci.hive.apache.org/job/hive-nightly/12/

I think we have 1 blocker issue:
https://issues.apache.org/jira/browse/HIVE-25665

I know about one more thing I would rather get fixed before we release

it:

https://issues.apache.org/jira/browse/HIVE-25994
The best would be to introduce smoke tests (HIVE-22302) to ensure that
something like this will not happen in the future - but we should

probably

start moving forward.

I think we could call the first iteration of this as "4.0.0-alpha-1" :)

I've added 4.0.0-alpha-1 as a version - and added the above two tic

Re: Result of the TPC-DS benchmark on Hive master branch

2020-11-17 Thread Sungwoo Park
>
>  > 1. With hive.optimize.shared.work.dppunion=true, query 2 and 59 fail.
> Please see the attachment for stack traces.
>
> Even thru the exception seem to be a reoccurance of the previous issue -
> existing checks + HIVE-24360 should have restricted all incorrect cases.
> I built in some debug stuff while I made these patches - and it would help
> a lot to get a peek into those; but they need to be enabled by
> hand/etc...while I polish that a
> bit more - could you please share an EXPLAIN FORMATTED about one of the
> queries failing because of that patch?
>

Please see the attachment for the result of EXPLAIN on query 12. (EXPLAIN
FORMATTED seems to have some problem.)  Hive tries to create two broadcast
edges from Reducer 8 to Map 6, thus raising an exception.

 > 2. Query 14 fails in both cases, and it seems like another bug. Note
> that when hive.cbo.enable is set to true when running query 14.
>
> I think you will find some cbo exception in the hive logs - explaining why
> it resorts to the non-cbo path.
>

Indeed it raises RuntimeException:

20/11/17 13:04:22 ERROR parse.CalcitePlanner: CBO failed, skipping CBO.
java.lang.RuntimeException: equivalence mapping violation
  at
org.apache.hadoop.hive.ql.plan.mapper.PlanMapper.link(PlanMapper.java:220)

 Please see the attachment for the full stack trace.


>  > 3. For some queries, the number of rows is different between the two
> experiments. In most cases, it seems to be rounding errors, but the
> difference is rather large for
> some queries (e.g., query 29 and 58). Please see the attachment for the
> result.
>
> that's very odd - I've recently fixed a bug in swo which may have caused
> issues like this(HIVE-24365); I would recommend to compare the result with
> the whole thing off
> (hive.optimize.shared.work=false).
> If you could isolate and reproduce this in a qtest I could also dig into
> it.


For now, let me report the result of testing HIVE-24366. Please see the
attachment for the result.

HIVE-24366 (e9f72e654750de208227d46a22e983413b080c6c, Thu Nov 12)

TEZ-4238 (22fec6c0ecc7ebe6f6f28800935cc6f69794dad5, Thu Oct 8)
guava.version=19.0 in pom.xml
hadoop.version=3.1.0 in pom.xml

TPC-DS 100GB ORC

hive.execution.engine=tez
hive.execution.mode=container, Tez containers are not reused across queries.
hive.cbo.enable=true
hive.query.reexecution.stats.persist.scope=metastore (default value)

1) hive.optimize.shared.work = false
2) hive.optimize.shared.work = true, hive.optimize.shared.work.dppunion =
true
3)  hive.optimize.shared.work = true, hive.optimize.shared.work.dppunion =
false

For each case, the first column reports the execution time and the second
column reports the number of rows. If the number of rows is 1, it also
reports the sum of all values in the result.

Cheers,

--- Sungwoo
1.

=== Query 2

EXPLAIN with wscs as
 (select sold_date_sk
,sales_price
  from (select ws_sold_date_sk sold_date_sk
  ,ws_ext_sales_price sales_price
from web_sales) x
union all
   (select cs_sold_date_sk sold_date_sk
  ,cs_ext_sales_price sales_price
from catalog_sales)),
 wswscs as
 (select d_week_seq,
sum(case when (d_day_name='Sunday') then sales_price else null end) 
sun_sales,
sum(case when (d_day_name='Monday') then sales_price else null end) 
mon_sales,
sum(case when (d_day_name='Tuesday') then sales_price else  null end) 
tue_sales,
sum(case when (d_day_name='Wednesday') then sales_price else null end) 
wed_sales,
sum(case when (d_day_name='Thursday') then sales_price else null end) 
thu_sales,
sum(case when (d_day_name='Friday') then sales_price else null end) 
fri_sales,
sum(case when (d_day_name='Saturday') then sales_price else null end) 
sat_sales
 from wscs
 ,date_dim
 where d_date_sk = sold_date_sk
 group by d_week_seq)
 select d_week_seq1
   ,round(sun_sales1/sun_sales2,2)
   ,round(mon_sales1/mon_sales2,2)
   ,round(tue_sales1/tue_sales2,2)
   ,round(wed_sales1/wed_sales2,2)
   ,round(thu_sales1/thu_sales2,2)
   ,round(fri_sales1/fri_sales2,2)
   ,round(sat_sales1/sat_sales2,2)
 from
 (select wswscs.d_week_seq d_week_seq1
,sun_sales sun_sales1
,mon_sales mon_sales1
,tue_sales tue_sales1
,wed_sales wed_sales1
,thu_sales thu_sales1
,fri_sales fri_sales1
,sat_sales sat_sales1
  from wswscs,date_dim
  where date_dim.d_week_seq = wswscs.d_week_seq and
d_year = 2001) y,
 (select wswscs.d_week_seq d_week_seq2
,sun_sales sun_sales2
,mon_sales mon_sales2
,tue_sales tue_sales2
,wed_sales wed_sales2
,thu_sales thu_sales2
,fri_sales fri_sales2
,sat_sales sat_sales2
  from wswscs
  ,date_dim
  where date_dim.d_week_seq = wswscs.d_week_seq and
d_year = 2001+1) z
 where d_week_seq1=d_week_seq2-53
 order by d_week_seq1;

=== Output

+

Re: Result of the TPC-DS benchmark on Hive master branch

2020-11-13 Thread Sungwoo Park
Hi Zoltan,

I have run another fresh TPC-DS test using the latest commit. Here is the
summary:

Commits used:

1) Hive, master, e9f72e654750de208227d46a22e983413b080c6c (HIVE-24366, Thu
Nov 12)
2) Tez, 0.10.0, 22fec6c0ecc7ebe6f6f28800935cc6f69794dad5 (CHANGES.txt
updated with TEZ-4238, Thu Oct 8)

Scenario:

1) create a database consisting of external tables from a 100GB TPC-DS text
dataset
2) create a database consisting of ORC tables
3) compute column statistics, set tez.runtime.compress=false
4) run TPC-DS queries and check the results

Configuration:

1) set hive.execution.engine=tez, hive.execution.mode=container
2) set hive.cbo.enable=true

Experiment #1: hive.optimize.shared.work.dppunion=true

Query 2 fails:

java.lang.IllegalArgumentException: Edge [Reducer 9 :
org.apache.hadoop.hive.ql.exec.tez.ReduceTezProcessor] -> [Map 6 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] ({ BROADCAST :
org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >>
org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager
}) already defined!

Query 14 fails:

org.apache.hadoop.hive.ql.parse.SemanticException: EXCEPT and INTERSECT
operations are only supported with Cost Based Optimizations enabled. Please
set 'hive.cbo.enable' to true!

Query 59 fails:

java.lang.IllegalArgumentException: Edge [Reducer 6 :
org.apache.hadoop.hive.ql.exec.tez.ReduceTezProcessor] -> [Map 4 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] ({ BROADCAST :
org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >>
org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager
}) already defined!

Experiment #2: hive.optimize.shared.work.dppunion=false

Query 14 fails:

org.apache.hive.service.cli.HiveSQLException: Error while compiling
statement: FAILED: SemanticException EXCEPT and INTERSECT operations are
only supported with Cost Based Optimizations enabled. Please set
'hive.cbo.enable' to true!

Summary:

1. With hive.optimize.shared.work.dppunion=true, query 2 and 59 fail.
Please see the attachment for stack traces.

2. Query 14 fails in both cases, and it seems like another bug. Note that
when hive.cbo.enable is set to true when running query 14.

3. For some queries, the number of rows is different between the two
experiments. In most cases, it seems to be rounding errors, but the
difference is rather large for some queries (e.g., query 29 and 58). Please
see the attachment for the result.

I could open a new Jira for this issue, or create a sub-task of HIVE-24384.
Or perhaps HIVE-24384 is already enough. So please let me know which would
be good for you.

(I have automated the entire experiment, so if you would like to see the
result of testing a new commit, I would be happy to rerun the experiment
and get back to you.)

Cheers,

--- Sungwoo

On Thu, Nov 12, 2020 at 10:49 PM Zoltan Haindrich  wrote:

> Hey Sungwoo!
>
> On 11/12/20 10:23 AM, Sungwoo Park wrote:
> > Hi Zoltan,
> >
> > I used the same hive-site.xml for the previous test (which was okay) and
> > the new test (which failed), so my guess is that it is perhaps due to a
> > commit since the previous test. Let me try later to identify the commit
> > that fails query 14, with the hope that identifying such a commit might
> be
> > useful in debugging.
>
> That would definetly help - if you could share the 2 commit hashes; it
> might be possible that we could guess it from the commit message or
> something.
>
>
> > Another question: is HIVE-24360 part of a solution to the problem of
> > hive.optimize.shared.work.dppunion?
> > I have tried the latest commit (which includes HIVE-24360) using the
> TPC-DS
> > benchmark, and it seems like the problem still exists.
>
> Yes, HIVE-24360 should have fixed that - do you still see an exception
> coming from tez-api reporting edge errors?
> I will also pick these changes for a smaller benchmark run soon...but I'm
> not running any right now. Could also note for which query you've seen the
> exception - so that I
> could also check it.
> Could you please open a jira about this - and add the actual exception
> trace/etc if available?
>
> cheers,
> Zoltan
>
> >
> > Cheers,
> >
> > --- Sungwoo
> >
> > On Mon, Nov 9, 2020 at 6:18 PM Zoltan Haindrich  wrote:
> >
> >> Hey Sungwoo!
> >>
> >> Regarding Q14 / "java.lang.RuntimeException: equivalence mapping
> violation"
> >>
> >>   From the stack trace you shared it seems like the mapper have already
> >> seen both the filter and the ast node earlier - and they are in separate
> >> mapping groups. (Which is
> >> unfortunate) I think it won't be simple to track that down - it will
> >> defin

Re: Result of the TPC-DS benchmark on Hive master branch

2020-11-12 Thread Sungwoo Park
Hi Zoltan,

I used the same hive-site.xml for the previous test (which was okay) and
the new test (which failed), so my guess is that it is perhaps due to a
commit since the previous test. Let me try later to identify the commit
that fails query 14, with the hope that identifying such a commit might be
useful in debugging.

Another question: is HIVE-24360 part of a solution to the problem of
hive.optimize.shared.work.dppunion?
I have tried the latest commit (which includes HIVE-24360) using the TPC-DS
benchmark, and it seems like the problem still exists.

Cheers,

--- Sungwoo

On Mon, Nov 9, 2020 at 6:18 PM Zoltan Haindrich  wrote:

> Hey Sungwoo!
>
> Regarding Q14 / "java.lang.RuntimeException: equivalence mapping violation"
>
>  From the stack trace you shared it seems like the mapper have already
> seen both the filter and the ast node earlier - and they are in separate
> mapping groups. (Which is
> unfortunate) I think it won't be simple to track that down - it will
> definetly need some debugging.
> The best would be to have a repro query for it...
>
> note: we already run q14 in TestTezPerf*Driver - could it might be
> possible that we've disabled some features in the hive-site.xml for these
> tests; and that's why we
> haven't seen it before?
>
> cheers,
> Zoltan
>
>


Re: Result of the TPC-DS benchmark on Hive master branch

2020-11-05 Thread Sungwoo Park
Hi Stamatis, Mustafa, Zoltán,

This is the result of a new experiment. These are the changes that I made:

1. Reverted HIVE-24139. (It turns out that  HIVE-24139 does not affect the
result of the TPC-DS benchmark.)
2. Set hive.optimize.shared.work.dppunion to false in hive-site.xml.
3. Set tez.runtime.compress to false in tez-site.xml.

Here is the result.

1. Loading ORC tables succeeds. However, if tez.runtime.compress is set to
true, it fails with the following error at runtime:

Caused by: java.lang.InternalError: Could not decompress data. Buffer
length is too small.
  at
org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompressBytesDirect(Native
Method)
  at
org.apache.hadoop.io.compress.snappy.SnappyDecompressor.decompress(SnappyDecompressor.java:235)
...

It may be that this error comes from Tez, not Hive.

2. All queries pass okay, except query 14 which fails during compilation.
HiveServer2 throws two errors during the compilation of query 14.

1)
20/11/05 15:30:00 ERROR parse.CalcitePlanner: CBO failed, skipping CBO.
java.lang.RuntimeException: equivalence mapping violation
  at
org.apache.hadoop.hive.ql.plan.mapper.PlanMapper.link(PlanMapper.java:220)
  at
org.apache.hadoop.hive.ql.plan.mapper.PlanMapper.link(PlanMapper.java:192)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3575)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3538)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10830)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11776)
...

2)
20/11/05 15:30:00 ERROR ql.Driver: FAILED: NullPointerException null
java.lang.NullPointerException
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4491)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4474)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:10940)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10882)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11776)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11646)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlanForSubQueryPredicate(SemanticAnalyzer.java:3386)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFilterPlan(SemanticAnalyzer.java:3484)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10830)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11776)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11636)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11636)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11646)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genOPTree(SemanticAnalyzer.java:12428)
  at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:718)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:12539)
  at
org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:443)
  at
org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:301)
  at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:223)
  at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:104)
...

So, it seems that the TPC-DS benchmark reveals two bugs.

Let me try to find the commits that introduce these bugs. If anybody has a
guess on what commits could potentially introduce these bugs (since
HIVE-23114, Fri Apr 10), please let me know. Another option is to analyze
query 14 to find a simpler query that reproduces the same bug, but
unfortuntely, it is a more challenging path for me.

Cheers,

--- Sungwoo


Result of the TPC-DS benchmark on Hive master branch

2020-11-04 Thread Sungwoo Park
Hello,

I have tested a recent commit of the master branch using the TPC-DS
benchmark. I used Hive on Tez (not Hive-LLAP). The way I tested is:

1) create a database consisting of external tables from a 100GB TPC-DS text
dataset
2) create a database consisting of ORC tables from the previous database
3) compute column statistics
4) run TPC-DS queries and check the results

Previously we tested the commit 5f47808c02816edcd4c323dfa25194536f3f20fd
(HIVE-23114: Insert overwrite with dynamic partitioning is not working
correctly with direct insert, Fri Apr 10), and all queries ran okay.

This time I used the following commits. I made a few changes to pom.xml of
both Hive and Tez, but these changes should not affect the result of
running queries.

1) Hive, master, 96aacdc50043fa442c2277b7629812e69241a507 (Tue Nov
3), HIVE-24314: compactor.Cleaner should not set state mark cleaned if it
didn't remove any files
2) Tez, 0.10.0, 22fec6c0ecc7ebe6f6f28800935cc6f69794dad5 (Thu Oct
8), CHANGES.txt updated with TEZ-4238

The result is that 14 queries (out of 99 queries) fail, and a query fails
during compilation for one of the following two reasons.

1)
org.apache.hive.service.cli.HiveSQLException: Error while compiling
statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.tez.TezTask. Edge [Map 12 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] -> [Map 7 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] ({ BROADCAST :
org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >>
org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager
}) already defined!
  at
org.apache.hive.service.cli.operation.Operation.toSQLException(Operation.java:365)
  at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:241)
  at
org.apache.hive.service.cli.operation.SQLOperation.access$500(SQLOperation.java:88)
  at
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork$1.run(SQLOperation.java:325)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:422)
  at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682)
  at
org.apache.hive.service.cli.operation.SQLOperation$BackgroundWork.run(SQLOperation.java:343)
  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
  at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalArgumentException: Edge [Map 12 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] -> [Map 7 :
org.apache.hadoop.hive.ql.exec.tez.MapTezProcessor] ({ BROADCAST :
org.apache.tez.runtime.library.input.UnorderedKVInput >> PERSISTED >>
org.apache.tez.runtime.library.output.UnorderedKVOutput >> NullEdgeManager
}) already defined!
  at org.apache.tez.dag.api.DAG.addEdge(DAG.java:297)
  at org.apache.hadoop.hive.ql.exec.tez.TezTask.build(TezTask.java:519)
  at org.apache.hadoop.hive.ql.exec.tez.TezTask.execute(TezTask.java:213)
  at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:213)
  at
org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:105)
  at org.apache.hadoop.hive.ql.Executor.launchTask(Executor.java:361)
  at org.apache.hadoop.hive.ql.Executor.launchTasks(Executor.java:334)
  at org.apache.hadoop.hive.ql.Executor.runTasks(Executor.java:245)
  at org.apache.hadoop.hive.ql.Executor.execute(Executor.java:108)
  at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:326)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:149)
  at org.apache.hadoop.hive.ql.Driver.run(Driver.java:144)
  at
org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:164)
  at
org.apache.hive.service.cli.operation.SQLOperation.runQuery(SQLOperation.java:228)
  ... 11 more

2)
Caused by: java.lang.NullPointerException
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4491)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genSelectPlan(SemanticAnalyzer.java:4474)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:10940)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:10882)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11776)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11633)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11660)
  at
org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:11646)
  at