date:20220128

Re: Time to Remove Hive-on-Spark

2022-01-28 Thread Stamatis Zampetakis

Hi team,

Almost one year has passed since the last exchange in this discussion and
if I am not wrong there has been no effort to revive Hive-on-Spark. To be
more precise, I don't think I have seen any Spark related JIRA for quite
some time now and although I don't want to rush into conclusions, there
does not seem to be any community member involved in maintaining or adding
new features in this part of the code.

Keeping dead code in the repository does not do any good to the project and
puts a non-negligible burden to future maintainers.

Clearly, we cannot make a new Hive release where a major feature is
completely untested so either someone commits to re-enable/fix the
respective tests soon or we move forward the work started by David and drop
support for Hive-on-Spark.

I would like to ask the community if there is anyone who can take up this
maintenance task and enable/fix Spark related tests in the next month or so?

Best,
Stamatis

On Sat, Feb 27, 2021 at 4:17 AM Edward Capriolo 
wrote:

> I do not know how it works for most of the world. But in cloudera where the
> TEZ options were never popular hive-on-spark represents a solid way to get
> things done for small datasets lower latency.
>
> As for the spark adoption. You know a while ago I came up with some ways to
> make hive more  spark like. One of them was a found a way to make "compile"
> a hive keyword so folks could build UDFs on the fly. It was such an
> uphil climb. Folks found a way to make it disabled by default for security.
> Then later when things moved from CLI to beeline it was like the ONLY thing
> that I found not ported. Like it was extremely frustrating.
>
>
>
>
>
>
> On Mon, Jul 27, 2020 at 3:19 PM David  wrote:
>
> > Hello  Xuefu,
> >
> > I am not part of the Cloudera Hive product team,  though I volunteer to
> > work on small projects from time to time.  Perhaps someone from that team
> > can chime in with some of their thoughts, but personally, I think that in
> > the long run, there will be more of a merge between Hive-on-Spark and
> other
> > Spark-native offerings.  I'm not sure what the differentiation will be
> > going forward.  With that said, are there any developers on this mailing
> > list who are willing to take on the maintenance effort of keeping HoS
> > moving forward?
> >
> > http://www.russellspitzer.com/2017/05/19/Spark-Sql-Thriftserver/
> >
> >
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.6.4/bk_spark-component-guide/content/config-sts.html
> >
> >
> > Thanks.
> >
> > On Thu, Jul 23, 2020 at 12:35 PM Xuefu Zhang  wrote:
> >
> > > Previous reasoning seemed to suggest a lack of user adoption. Now we
> are
> > > concerned about ongoing maintenance effort. Both are valid
> > considerations.
> > > However, I think we should have ways to find out the answers.
> Therefore,
> > I
> > > suggest the following be carried out:
> > >
> > > 1. Send out the proposal (removing Hive on Spark) to users including
> > > u...@hive.apache.org and get their feedback.
> > > 2. Ask if any developers on this mailing list are willing to take on
> the
> > > maintenance effort.
> > >
> > > I'm concerned about user impact because I can still see issues being
> > > reported on HoS from time to time. I'm more concerned about the future
> of
> > > Hive if we narrow Hive neutrality on execution engines, which will
> > possibly
> > > force more Hive users to migrate to other alternatives such as Spark
> SQL,
> > > which is already eroding Hive's user base.
> > >
> > > Being open and neutral used to be Hive's most admired strengths.
> > >
> > > Thanks,
> > > Xuefu
> > >
> > >
> > > On Wed, Jul 22, 2020 at 8:46 AM Alan Gates 
> wrote:
> > >
> > > > An important point here is I don't believe David is proposing to
> remove
> > > > Hive on Spark from the 2 or 3 lines, but only from trunk.  Continuing
> > to
> > > > support it in existing 2 and 3 lines makes sense, but since no one
> has
> > > > maintained it on trunk for some time and it does not work with many
> of
> > > the
> > > > newer features it should be removed from trunk.
> > > >
> > > > Alan.
> > > >
> > > > On Tue, Jul 21, 2020 at 4:10 PM Chao Sun  wrote:
> > > >
> > > > > Thanks David. FWIW Uber is still running Hive on Spark (2.3.4) on a
> > > very
> > > > > large scale in production right now and I don't think we have any
> > plan
> > > to
> > > > > change it soon.
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jul 21, 2020 at 11:28 AM David  wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > Thanks for the feedback.
> > > > > >
> > > > > > Just a quick recap: I did propose this @dev and I received
> > unanimous
> > > > +1's
> > > > > > from the community.  After a couple months, I created the PR.
> > > > > >
> > > > > > Certainly open to discussion, but there hasn't been any
> discussion
> > > thus
> > > > > far
> > > > > > because there have been no objections until this point.
> > > > > >
> > > > > > HoS has low adoption, heavy technical debt, and the manner in
>

[jira] [Created] (HIVE-25909) Add test for 'hive.default.nulls.last' property for windows with ordering

2022-01-28 Thread Alessandro Solimando (Jira)

Alessandro Solimando created HIVE-25909:
---

 Summary: Add test for 'hive.default.nulls.last' property for 
windows with ordering
 Key: HIVE-25909
 URL: https://issues.apache.org/jira/browse/HIVE-25909
 Project: Hive
  Issue Type: Test
  Components: CBO
Affects Versions: 4.0.0
Reporter: Alessandro Solimando
Assignee: Alessandro Solimando


Add a test around "hive.default.nulls.last" configuration property and its 
interaction with order by clauses within windows.

The property is known to respect such properties:

 
||hive.default.nulls.last||ASC||DESC||
|true|NULL LAST|NULL FIRST|
|false|NULL FIRST|NULL LAST|

 

 

The test can be based along the line of the following examples:
{noformat}
-- hive.default.nulls.last is true by default, it sets NULLS_FIRST for DESC
set hive.default.nulls.last;

OUT:
hive.default.nulls.last=true

SELECT a, b, c, row_number() OVER (PARTITION BY a, b ORDER BY b DESC, c DESC)
FROM test1;

OUT:
John Doe        1990-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1990-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1990-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1990-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1990-05-10 00:00:00     2021-09-10 00:00:00     5
John Doe        1987-05-10 00:00:00     NULL    1
John Doe        1987-05-10 00:00:00     2022-01-10 00:00:00     2
John Doe        1987-05-10 00:00:00     2021-12-10 00:00:00     3
John Doe        1987-05-10 00:00:00     2021-11-10 00:00:00     4
John Doe        1987-05-10 00:00:00     2021-10-10 00:00:00     5

-- we set hive.default.nulls.last=false, it sets NULLS_LAST for DESC
set hive.default.nulls.last=false;

SELECT a, b, c, row_number() OVER (PARTITION BY a, b ORDER BY b DESC, c DESC)
FROM test1;

OUT:
John Doe        1990-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1990-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1990-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1990-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1990-05-10 00:00:00     2021-09-10 00:00:00     5
John Doe        1987-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1987-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1987-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1987-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1987-05-10 00:00:00     NULL    5

-- we set hive.default.nulls.last=false but we have explicit NULLS_LAST, we 
expect NULLS_LAST
set hive.default.nulls.last=false;

SELECT a, b, c, row_number() OVER (PARTITION BY a, b ORDER BY b DESC, c DESC 
NULLS LAST)
FROM test1;

OUT:
John Doe        1990-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1990-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1990-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1990-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1990-05-10 00:00:00     2021-09-10 00:00:00     5
John Doe        1987-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1987-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1987-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1987-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1987-05-10 00:00:00     NULL    5

-- we have explicit NULLS_FIRST, we expect NULLS_FIRST
SELECT a, b, c, row_number() OVER (PARTITION BY a, b ORDER BY b DESC, c DESC 
NULLS FIRST)
FROM test1;

--OUT:
John Doe        1990-05-10 00:00:00     2022-01-10 00:00:00     1
John Doe        1990-05-10 00:00:00     2021-12-10 00:00:00     2
John Doe        1990-05-10 00:00:00     2021-11-10 00:00:00     3
John Doe        1990-05-10 00:00:00     2021-10-10 00:00:00     4
John Doe        1990-05-10 00:00:00     2021-09-10 00:00:00     5
John Doe        1987-05-10 00:00:00     NULL    1
John Doe        1987-05-10 00:00:00     2022-01-10 00:00:00     2
John Doe        1987-05-10 00:00:00     2021-12-10 00:00:00     3
John Doe        1987-05-10 00:00:00     2021-11-10 00:00:00     4
John Doe        1987-05-10 00:00:00     2021-10-10 00:00:00     5{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Re: Time to Remove Hive-on-Spark

[jira] [Created] (HIVE-25909) Add test for 'hive.default.nulls.last' property for windows with ordering

2 matches

Site Navigation

Mail list logo

Footer information