Re: Role of Hadoop code in Cassandra 5.0

Miklosovic, Stefan Fri, 17 Mar 2023 00:38:54 -0700

You can initiate that vote if you want, I do not see any problem with that.


Please be aware that there is ongoing vote on 4.1.1 release so we have couple 
of options:

a) someone posts binding -1 there which will render that vote as failed.
b) the voting on 4.1.1 just passes and we release it as it is.

The 4.1.1-tentative has this in NEWS.txt (1). If we fail the 4.1.1 vote, we 
need to get back to this, remove that line which says that it will be removed 
in the next major (5.0) and redeploy all artifacts and vote on it again.

You can also initiate the formal Hadoop removal in 5.0 vote if you think that 
community consensus in this thread is not enough already and then see if it 
fails or passes. If it passes, it will unblock the voting thread on 4.1.1 so we 
will not need to redeploy anything.

From my perspective, the voting thread for 4.1.1 has the precedence here. If 
you are not satisfied with what we are going to ship, you have the very right 
to object that and stop it.

Regards

(1) https://github.com/apache/cassandra/blob/4.1.1-tentative/NEWS.txt#L76

________________________________________
From: David Capwell <[email protected]>
Sent: Thursday, March 16, 2023 23:17
To: [email protected]
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



The actual removal in 6.0 means that this will not go away sooner than ... 2025?

This is more of a discuss thread and not a vote thread, so if we wish to change 
our previously agreed rules for this I think we do need to vote on it.  Given 
that 4.1.0 was also out, the argument would be to deprecate in 4.1.x (patch 
release) and we are currently talking about when we can freeze 5.0 (even though 
its a few months away), so the removal is very fast...

My vote would be +0, but I am trying to call out that a change in our behavior 
should be more visable

On Mar 16, 2023, at 2:46 PM, Miklosovic, Stefan <[email protected]> 
wrote:

I reached out to Hadoop Slack channel and I asked if there would be somebody to 
help us with the update. The first response was something about "why do you 
ask? we are not going to spend time on updating it for you" (fair enough), next 
responses were like "if this is so old and not maintained, it does not seem 
like people even care" which is totally spot on. Who are we really addressing 
here? This integration became practically irrelevant the day Spark connector 
was mature enough, it will be even less so once 5.0 is out.

The actual removal in 6.0 means that this will not go away sooner than ... 
2025? Do we really want to be removing 12 years old code two years from now? 
That also means that we need to make sure it at least compiles etc. I would say 
that nobody cares already.

I do not have a problem with dropping an email to user's list. I'll get to it 
tomorrow.

________________________________________
From: Jeremy Hanna 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, March 16, 2023 22:27
To: [email protected]<mailto:[email protected]>
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Regarding deprecation, while I support the deprecation and removal from the 
Cassandra codebase, I do think we should communicate that with the wider 
community (user thread?) so people aren't surprised - especially since it's 
already four months after the 4.1.0 release.  That would hopefully also 
encourage those interested in continuing support to extract it out into a 
separate library.

On Mar 16, 2023, at 4:19 PM, Miklosovic, Stefan <[email protected]> 
wrote:

I think we already decided it in this thread.

I was specifically asking this question:

Deprecation would mean that the code has to be there whole 5.0 so we can remove 
it for real in 6.0?

To which the response was:

I think if we reach consensus here that decides it. I too vote to
deprecate in 4.1.x.  This means we would remove it in 5.0.

Then bunch of +1s followed and agreed with that explicitly.

I do not plan to maintain nor extract that, personally.

________________________________________
From: David Capwell 
<[email protected]<mailto:[email protected]><mailto:[email protected]>>
Sent: Thursday, March 16, 2023 22:13
To: 
[email protected]<mailto:[email protected]><mailto:[email protected]>
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



Isn’t our deprecation rules that if we deprecate in 4.0.0 we can remove in 5.x, 
but 4.x needs to wait for 6.x?  I am cool deprecating this and willing to pull 
into another repo if people (not me) are willing to maintain it (else just 
delete).

On Mar 10, 2023, at 1:13 AM, Jacek Lewandowski 
<[email protected]<mailto:[email protected]>> wrote:

I've experimentally added https://issues.apache.org/jira/browse/CASSANDRA-16984 
to https://issues.apache.org/jira/browse/CASSANDRA-18306 (post 4.0 cleanup)

- - -- --- ----- -------- -------------
Jacek Lewandowski


pt., 10 mar 2023 o 09:56 Berenguer Blasi 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 napisał(a):

+1 deprecate + removal

On 10/3/23 1:41, Jeremy Hanna wrote:
It was mainly to integrate with Hadoop - I used it from 0.6 to 1.2 in 
production prior to starting at DataStax and at that time I was stitching 
together Cloudera's distribution of Hadoop with Cassandra.  Back then there 
were others that used it as well.  As far as I know, usage dropped off when the 
Spark Cassandra Connector got pretty mature.  It enabled people to take an off 
the shelf Hadoop distribution and run the Hadoop processes on the same nodes or 
external to the Cassandra cluster and get topology information to do things 
like Hadoop splits and things like that through the Hadoop interfaces.  I think 
the version lag is an indication that it hasn't been used recently.  Also, like 
others have said, the Spark Cassandra Connector is really what people should be 
using at this point imo.  That or depending on the use case, Apple's bulk 
reader: https://github.com/jberragan/spark-cassandra-bulkreader that is 
mentioned on https://issues.apache.org/jira/browse/CASSANDRA-16222.

On Mar 9, 2023, at 12:00 PM, Rahul Xavier Singh 
<[email protected]<mailto:[email protected]><mailto:[email protected]>><mailto:[email protected]>
 wrote:

What is the hadoop code for? For interacting from Hadoop via CQL, or Thrift if 
it's that old, or directly looking at SSTables? Been using C* since 2 and have 
never used it.

Agree to deprecate in next possible 4.1.x version and remove in 5.0

Rahul Singh
Chief Executive Officer | Business Platform Architect m: 202.905.2818 e: 
[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>
 li: http://linkedin.com/in/xingh ca: http://calendly.com/xingh

We create, support, and manage real-time global data & analytics platforms for 
the modern enterprise.

Anant | https://anant.us<https://anant.us/>
3 Washington Circle, Suite 301
Washington, D.C. 20037

http://Cassandra.Link<http://cassandra.link/><http://cassandra.link<http://cassandra.link/>><http://cassandra.link<http://cassandra.link/><http://cassandra.link<http://cassandra.link/>>>
 : The best resources for Apache Cassandra


On Thu, Mar 9, 2023 at 12:53 PM Brandon Williams 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:
I think if we reach consensus here that decides it. I too vote to
deprecate in 4.1.x.  This means we would remove it in 5.0.

Kind Regards,
Brandon

On Thu, Mar 9, 2023 at 11:32 AM Ekaterina Dimitrova
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

Deprecation sounds good to me, but I am not completely sure in which version we 
can do it. If it is possible to add a deprecation warning in the 4.x series or 
at least 4.1.x - I vote for that.

On Thu, 9 Mar 2023 at 12:14, Jacek Lewandowski 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

Is it possible to deprecate it in the 4.1.x patch release? :)


- - -- --- ----- -------- -------------
Jacek Lewandowski


czw., 9 mar 2023 o 18:11 Brandon Williams 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 napisał(a):

This is my feeling too, but I think we should accomplish this by
deprecating it first.  I don't expect anything will change after the
deprecation period.

Kind Regards,
Brandon

On Thu, Mar 9, 2023 at 11:09 AM Jacek Lewandowski
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 wrote:

I vote for removing it entirely.

thanks
- - -- --- ----- -------- -------------
Jacek Lewandowski


czw., 9 mar 2023 o 18:07 Miklosovic, Stefan 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
 napisał(a):

Derek,

I have couple more points ... I do not think that extracting it to a separate 
repository is "win". That code is on Hadoop 1.0.3. We would be spending a lot 
of work on extracting it just to extract 10 years old code with occasional 
updates (in my humble opinion just to make it compilable again if the code 
around changes). What good is in that? We would have one more place to take 
care of ... Now we at least have it all in one place.

I believe we have four options:

1) leave it there so it will be like this is for next years with questionable 
and diminishing usage
2) update it to Hadoop 3.3 (I wonder who is going to do that)
3) 2) and extract it to a separate repository but if we do 2) we can just leave 
it there
4) remove it

________________________________________
From: Derek Chen-Becker 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>>
Sent: Thursday, March 9, 2023 15:55
To: 
[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]>
Subject: Re: Role of Hadoop code in Cassandra 5.0

NetApp Security WARNING: This is an external email. Do not click links or open 
attachments unless you recognize the sender and know the content is safe.



I think the question isn't "Who ... is still using that?" but more "are we 
actually going to support it?" If we're on a version that old it would appear 
that we've basically abandoned it, although there do appear to have been 
refactoring (for other things) commits in the last couple of years. I would be 
in favor of removal from 5.0, but at the very least, could it be moved into a 
separate repo/package so that it's not pulling a relatively large dependency 
subtree from Hadoop into our main codebase?

Cheers,

Derek

On Thu, Mar 9, 2023 at 6:44 AM Miklosovic, Stefan 
<[email protected]<mailto:[email protected]><mailto:[email protected]><mailto:[email protected]><mailto:[email protected]<mailto:[email protected]>>>
 wrote:
Hi list,

I stumbled upon Hadoop package again. I think there was some discussion about 
the relevancy of Hadoop code some time ago but I would like to ask this again.

Do you think Hadoop code (1) is still relevant in 5.0? Who in the industry is 
still using that?

We might drop a lot of code and some Hadoop dependencies too (3) (even their 
scope is "provided"). The version of Hadoop we build upon is 1.0.3 which was 
released 10 years ago. This code does not have any tests nor documentation on 
the website.

There seems to be issues like this (2) and it seems like the solution is to, 
basically, use Spark Cassandra connector instead which I would say is quite 
reasonable.

Regards

(1) 
https://github.com/apache/cassandra/tree/trunk/src/java/org/apache/cassandra/hadoop
(2) https://lists.apache.org/thread/jdy5hdc2l7l29h04dqol5ylroqos1y2p
(3) 
https://github.com/apache/cassandra/blob/trunk/.build/parent-pom-template.xml#L507-L589


--
+---------------------------------------------------------------+
| Derek Chen-Becker                                             |
| GPG Key available at https://keybase.io/dchenbecker and       |
| https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org |
| Fngrprnt: EB8A 6480 F0A3 C8EB C1E7  7F42 AFC5 AFEE 96E4 6ACC  |
+---------------------------------------------------------------+

Re: Role of Hadoop code in Cassandra 5.0

Reply via email to