This is an interesting approach, Michael. I took it a bit further by
excluding all authors with only a single commit[1], since I think GitHub
PRs tend to highlight that kind of contribution more. Since 2012 I found 24
lucene-only, 31 solr-only, and 77 (about 58%) contributing to both. Since
2018, excluding authors with a single commit, the number went down to 51%
of the authors with commits to both projects. But, I think that speaks to a
very high degree of collaboration in my opinion.


Dawid, thank you for putting this together. It has obviously been carefully
thought over, and there's a lot of content, so I'm not going to try to
comment on everything, but will highlight a few things that caught my
attention.


> This is a DISCUSS thread and it will be followed next week by a VOTE
thread.
This sounds like a decision has already been made. Additionally, all of the
counterarguments presented come with rebuttals attached, so I'm not sure if
this is supposed to be a persuasive case or an expositional one.
I think I have an initial reaction that I'm opposed to a split, but I'm not
yet concretely sure why.

> Precommit/ test times. These are crazy high.
This seems like an argument for fixing the tests and making them faster,
I'm not sure how we get to splitting the projects from here. If you're
doing Solr only changes, it's pretty easy to run "./gradlew -p solr test"
and skip the lucene tests, similar for lucene only development.

> Mailing lists, build servers
This is probably a good idea and I think this is easy enough to do without
splitting the project as well.

> Solr should have its own cadence of releases driven by features, not
sub-component changes
Yea, I think this is very likely to happen, where new Lucene versions may
not immediately get integrated into the next Solr version, or perhaps not
at all, unless somebody is specifically interested in a feature that it
offers. I think developers are busy, and incrementing a dependency version
is not something that happens unless there is a tangible reason. Which
leads directly into the next point...

> Solr tests are the first “battlefield” test zone for Lucene changes
I think https://issues.apache.org/jira/browse/SOLR-14428 is a great example
of the kind of collaboration that we can see, and a good hint of what to
expect if the projects are split. To summarize, there was a Lucene change
which caused some issues in Solr. The fix is likely going to end up being
another Lucene change, but just as easily could have been a kind of ugly
workaround on the Solr side.

I think the points and counterpoints are essentially correct, but the
opening statement appears to undersell the counterarguments as a matter of
degree, in my view. I'll continue to think on this, and post more as ideas
solidify in my head.

[1]: git shortlog -s -n --since=2018 | grep -v '\s1\s' | cut -c7-

On Mon, May 4, 2020 at 9:49 AM Michael Sokolov <msoko...@gmail.com> wrote:

> I always like to look at data when making a big decision, so I
> gathered some statistics about authors and commits to git over the
> history of the project. I wanted to see what these statistics could
> tell us about the degree of overlap between the two projects and
> whether it has changed over time. Using commands like
>
>      git log --pretty=%an --since=2012 --lucene
>      git log --pretty=%an --since=2012 --solr
>
> I looked at the authors of commits in the lucene and solr top-level
> folders of the project. I think this makes a reasonable proxy for
> contributors to the two projects. From there I found that since 2012,
> there are 60 Lucene-only authors, 71 Solr-only authors, and 101
> authors (or 43%) contributing at least one commit to each project.
> Since 2018, the percentage of both-project authors is somewhat lower:
> 36%.
>
> I also looked at commits spanning both projects. I'm not sure this
> captures all the work that touches both projects, but it's a window
> into that, at least. I found that since 2012, 1387/19063 (6.8%) of
> commits spanned both project folders. Since 2018, 7.4% did.
>
> I don't think you can really draw very many meaningful conclusions
> from this, but a few things jump out: First, it is clear that these
> projects are not completely separate today. A substantial number of
> people commit to both, over time, although most people do not. Also,
> relatively few commits span both projects. Some do though, and it's
> certainly worth considering what the workflow for such changes would
> be like in the split world. Maybe a majority of these are
> build-related; it's hard to tell from this coarse analysis.
>
>
> On Mon, May 4, 2020 at 5:11 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
> >
> > Dear Lucene and Solr developers!
> >
> > A few days ago, I initiated a discussion among PMC members about
> > potential pros and cons of splitting the project into separate Lucene
> > and Solr entities by promoting Solr to its own top-level Apache
> > project (TLP). Let me share with you the motivation for such an action
> > and some follow-up thoughts I heard from other PMC members so far.
> >
> > Please read this e-mail carefully. Both the PMC and I look forward to
> > hearing your opinion. This is a DISCUSS thread and it will be followed
> > next week by a VOTE thread. This is our shared project and we should
> > all shape its future responsibly.
> >
> > The big question is this: “Is this the right time to split Solr and
> > Lucene into two independent projects?”.
> >
> > Here are several technical considerations that drove me to ask the
> > question above (in no order of priorities):
> >
> > 1) Precommit/ test times. These are crazy high. If we split into two
> > projects we can pretty much cut all of Lucene testing out of Solr (and
> > likewise), making development a bit more fun again.
> >
> > 2) Build system itself and source release packaging. The current
> > combined codebase is a *beast* to maintain. Working with gradle on
> > both projects at once made me realise how little the two have in
> > common. The code layout, the dependencies, even the workflow of people
> >
> > working on these projects... The build (both ant and gradle) is full
> > of Solr and Lucene-specific exceptions and hooks that could be more
> > elegantly solved if moved to each project independently.
> >
> > 3) Packaging. There is no single source distribution package for
> > Solr+Lucene. They are already "independent" there. Why should Lucene
> > and Solr always be released at the same pace? Does it always make
> > sense?
> >
> > 4) Solr is essentially taking in Lucene and its dependencies as a
> > whole (so is Elasticsearch and many other projects). In my opinion
> > this makes Lucene eligible for refactoring and
> >
> > maintenance as a separate component. The learning curve for people
> > coming to each project separately is going to be gentler than trying
> > to dive into the combined codebase.
> >
> > 5) Mailing lists, build servers. Mailing lists for users are already
> > separated. I think this is yet another indication that Solr is
> > something more than a component within Lucene. It is perceived as an
> > independent entity and used as an independent product. I would really
> > like to have separate mailing lists for these two projects (this
> > includes build and test results) as it would make life easier: if your
> > focus is more on Lucene (or Solr), you would only need to track half
> > of the current traffic.
> >
> >
> > As I already mentioned, the discussion among PMC members highlighted
> > some initial concerns and reasons why the project should perhaps
> > remain glued together. These are outlined below with some of the
> > counter-arguments presented under each concern to avoid repetition of
> > the same content from the PMC mailing list (they’re copied from the
> > private discussion list).
> >
> > 1) Both projects may gradually split their ways after the separation
> > and even develop “against” each other like it used to be before the
> > merge.
> >
> > Whether this is a legitimate concern is hard to tell. If Solr goes TLP
> > then all existing Lucene committers will automatically become Solr
> > committers (unless they opt not to) so there will be both procedural
> > ways to prevent this from happening (vetoes) as well as common-sense
> > reasons to just cooperate.
> >
> > 2) Some people like parallel version numbering (concurrent Solr and
> > Lucene releases) as it gives instant clarity which Solr version uses
> > which version of Lucene.
> >
> > This can still be done on Solr side (it is Solr’s decision to adapt
> > any versioning scheme the project feels comfortable with). I
> > personally (DW) think this kind of versioning is actually more
> > confusing than helpful; Solr should have its own cadence of releases
> > driven by features, not sub-component changes. If the “backwards
> > compatibility” is a factor then a solution might be to sync on major
> > version releases only (e.g., this is how Elasticsearch is handling
> > this).
> >
> > 3) Solr tests are the first “battlefield” test zone for Lucene changes
> > - if it becomes TLP this part will be gone.
> >
> > Yes, true. But realistically Solr will have to adopt some kind of
> > snapshot-based dependency on Lucene anyway (whether as a git submodule
> > or a maven snapshot dependency). So if there are bugs in Lucene they
> > will still be detected by Solr tests (and fairly early).
> >
> > 4) Why split now if we merged in the first place?
> >
> > Some of you may wonder why split the project that was initially
> > *merged* from two independent codebases (around 10 years ago). In
> > short, there was a lot of code duplication and interaction between
> > Solr and Lucene back then, with patches flying back and forth.
> > Integration into a single codebase seemed like a great idea to clean
> > things up and make things easier. In many ways this is exactly what
> > did happen: we have cleaned up code dependencies and reusable
> > components (on Lucene side) consumed by not just Solr but also other
> > projects (downstream from Lucene).
> >
> > The situation we find ourselves now is different to what it was
> > before: recent and ongoing development for the most part falls within
> > Solr or Lucene exclusively.
> >
> >
> > This e-mail is for discussing the idea and presenting arguments/
> > counter-arguments for or against the split. It will be followed by a
> > separate VOTE thread e-mail next Monday. If the vote passes then there
> > are many questions about how this process should be arranged and
> > orchestrated. There are past examples even within Lucene [1] that we
> > can learn from, and there are people who know how to do it - the
> > actual process is of lesser concern at the moment, what we mostly want
> > to do is to reach out to you, signal the idea and ask about your
> > opinion. Let us know what you think.
> >
> > [1]
> https://lists.apache.org/thread.html/15bf2dc6d6ccd25459f8a43f0122751eedd3834caa31705f790844d7%401270142638%40%3Cuser.nutch.apache.org%3E
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: dev-h...@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to