There are two issues and one main benefit that I see with versioned
programming guides:

   - *Issue 1*: We often retroactively realize that code snippets have bugs
   and explanations are confusing (see examples: dropDuplicates
   <https://github.com/apache/spark/pull/46797>,
   dropDuplicatesWithinWatermark
   
<https://stackoverflow.com/questions/77512507/how-exactly-does-dropduplicateswithinwatermark-work>).
   Without backporting to older guides, I don't think that users can have, as
   Mridul says, "reasonable confidence that features, functionality and
   examples mentioned will work with that released Spark version". In this
   sense, I definitely disagree with Nimrod's position of "working on updated
   versions and not working with old versions anyway." To have confidence in
   versioned programming guides, we *must *have a system for backporting
   and re-releasing.
   - *Issue 2*: If programming guides live in the Spark website, you now
   need maintenance releases in Spark to get those changes to production (i.e.
   spark-website). Historically, Spark does *not *create maintenance
   releases frequently, especially not just for a docs change. So, we'd need
   to break precedent (this will create potentially dozens of minor releases,
   far more than what we do today), and the person making docs changes needs
   to rebuild the docs site and create one PR in spark-website for
*every *version
   they change. Fixing a code typo in 4 versions? You need 4 maintenance
   releases, and 4 more PRs.
   - *Benefit 1*: versioned docs don't have to caveat what features are
   available in prose.


Personally, I think it's fine to caveat what features are available in
prose. For the rare case where we have *completely *incompatible Spark code
(which should be exceedingly rare), we can provide different code snippets.
As Wenchen points out, if we *do *have 100 mutually incompatible versions,
we have an issue, but the ANSI SQL default might be one of these rare
examples.

(Note: version-specific commentary is already present in the Structured
Streaming Programming Guide, our most popular
<https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?idSite=40&period=day&date=yesterday&category=General_Actions&subcategory=General_Pages>
guide. It flows nicely: for example, we talk about state, and then we say,
"hey, if you have Spark 4.0, state is more easily debuggable because of the
state reader." The prose focuses on the stable concept of state—which has
been unchanged since 2.0.0—and then mentions a feature that can
encourage upgrade.)

However, I do see one path forward with versioned guides: 1) guide changes
do not constitute a maintenance release 2) we create an automation to allow
us to backport docs changes to old branches 3) once merged in Spark, the
automation rebuilds all the static sites and creates PRs in spark-website.
The downside is that backport merge conflicts *will *force developers to
backport changes themselves. While I do not want to sign up for that work,
is this something people are more comfortable with?

Neil


On Tue, Jun 11, 2024 at 8:47 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Just FYI, the Hive languages manual is also version-less:
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual
>
> It's not a strong data point as this doc is not actively updated, but my
> personal feeling is that it's nice to see the history of a feature: when it
> was introduced, when it got changed, with JIRA ticket linked.
>
> One potential issue is that if a feature has been changed 100 times in
> history, it's too verbose to document all 100 different behaviors for
> different versions. If that happens, I think we can make each major version
> have its own programming guide, assuming we won't change a feature 100
> times in Spark 4 :)
>
> On Mon, Jun 10, 2024 at 1:08 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:
>
>> My personal opinion is that having the documents per version (current and
>> previous), without fixing previous versions - just keeping them as a
>> snapshot in time of the current documentation once the new version was
>> released, should be good enough.
>>
>> Because now Neil would like to change the documentation (personally I
>> think it's very needed and it's a great thing to do) - there will be a big
>> gap between the old documents and the new ones...
>> If after rewriting and rearenging the documents someone would feel it can
>> be beneficial to port back the documentation for some of the older versions
>> as well as a one time thing, that's possible as well of course...
>>
>> I find this solution to be best of all worlds - versioned, so you can
>> read documents which are relevant to the version you use (though I am in
>> favour of working on updated versions and not working with old versions
>> anyway), while the documentation can be updated many times, after the
>> release and independently from the actual release of Spark.
>>
>> I think that keeping one document to support all versions will soon
>> become hard to read and understand with little benefit of having updated
>> documentation for old versions.
>>
>>
>> Regarding SEO and deranking, afaik updating the documentation more
>> frequently should only improve ranking so the latest documentation should
>> always be ranked high in Google search, but maybe I'm missing something.
>>
>> Nimrod
>>
>>
>>
>> בתאריך יום ב׳, 10 ביוני 2024, 21:25, מאת Nicholas Chammas ‏<
>> nicholas.cham...@gmail.com>:
>>
>>> I will let Neil and Matt clarify the details because I believe they
>>> understand the overall picture better. However, I would like to emphasize
>>> something that motivated this effort and which may be getting lost in the
>>> concerns about versioned vs. versionless docs.
>>>
>>> The main problem is that some of the guides need major overhauls.
>>>
>>> There are people like Neil who are interested in making significant
>>> contributions to the guides. What is holding them back is that major
>>> changes to the web docs can trigger wholesale deranking of our site by
>>> Google. Since versioned docs are tied to Spark releases, which are
>>> infrequent, that means potentially being nuked in the search rankings for
>>> months.
>>>
>>> Versionless docs allow for rapid iteration on the guides, which can be
>>> driven in part by search rankings.
>>>
>>> In other words, there is a problem chain here that leads to versionless
>>> docs:
>>>
>>> 1. Several guides need major improvements.
>>> 2. We cannot make such improvements because a) that would risk site
>>> deranking, and b) we are constrained by Spark's release schedule.
>>> 3. Versionless guides allow for incremental improvements, which
>>> addresses problems 2a and 2b.
>>>
>>> This is my understanding of the big picture as described to me by Neil
>>> and Matt. I defer to them to elaborate on the details, especially in
>>> relation to Google site rankings. If this concern is not valid or not that
>>> serious, then we can just iterate slowly on the docs with Spark’s existing
>>> release schedule and there is less need for versionless docs.
>>>
>>> Nick
>>>
>>>
>>> On Jun 10, 2024, at 1:53 PM, Mridul Muralidharan <mri...@gmail.com>
>>> wrote:
>>>
>>>
>>> Hi,
>>>
>>>   Versioned documentation has the benefit that users can have reasonable
>>> confidence that features, functionality and examples mentioned will work
>>> with that released Spark version.
>>> A versionless guide runs into potential issues with deprecation,
>>> behavioral changes and new features.
>>>
>>> My concern is not just around features highlighting their supported
>>> versions, but examples which reference others features in spark.
>>>
>>> For example, sql differences between hive ql and ansi sql when we flip
>>> the default in 4.0 : we would have 4.x example snippets for some feature
>>> (say UDAF) which would not work for 3.x and vice versa.
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Mon, Jun 10, 2024 at 12:03 PM Hyukjin Kwon <gurwls...@apache.org>
>>> wrote:
>>>
>>>> I am +1 on this but as you guys mentioned, we should really be clear on
>>>> how to address different versions.
>>>>
>>>> On Wed, 5 Jun 2024 at 18:27, Matthew Powers <
>>>> matthewkevinpow...@gmail.com> wrote:
>>>>
>>>>> I am a huge fan of the Apache Spark docs and I regularly look at the
>>>>> analytics on this page
>>>>> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1>
>>>>> to see how well they are doing.  Great work to everyone that's contributed
>>>>> to the docs over the years.
>>>>>
>>>>> We've been chipping away with some improvements over the past year and
>>>>> have made good progress.  For example, lots of the pages were missing
>>>>> canonical links.  Canonical links are a special type of link that are
>>>>> extremely important for any site that has duplicate content.  Versioned
>>>>> documentation sites have lots of duplicate pages, so getting these
>>>>> canonical links added was important.  It wasn't really easy to make this
>>>>> change though.
>>>>>
>>>>> The current site is confusing Google a bit.  If you do a "spark
>>>>> rocksdb" Google search for example, you get the Spark 3.2 Structured
>>>>> Streaming Programming Guide as the first result (because Google isn't
>>>>> properly indexing the docs).  You need to Control+F and search for
>>>>> "rocksdb" to navigate to the relevant section which says: "As of
>>>>> Spark 3.2, we add a new built-in state store implementation...",
>>>>> which is what you'd expect in a versionless docs site in any case.
>>>>>
>>>>> There are two different user experiences:
>>>>>
>>>>> * Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1
>>>>> Structured Streaming Programming guide that doesn't mention RocksDB
>>>>> * Option B: push Spark Structured Streaming users to the latest
>>>>> Structure Streaming Programming guide, which mentions RocksDB, but caveat
>>>>> that this feature was added in Spark 3.2
>>>>>
>>>>> I think Option B provides Spark 3.1 users a better experience
>>>>> overall.  It's better to let users know they can access RocksDB by
>>>>> upgrading than hiding this info from them IMO.
>>>>>
>>>>> Now if we want Option A, then we'd need to give users a reasonable way
>>>>> to actually navigate to the Spark 3.1 docs.  From what I can tell, the 
>>>>> only
>>>>> way to navigate from the latest Structured Streaming Programming Guide
>>>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>>>>> to a different version is by manually updating the URL.
>>>>>
>>>>> I was just skimming over the Structured Streaming Programming guide
>>>>> and noticing again how lots of the Python code snippets aren't PEP 8
>>>>> compliant.  It seems like our current docs publishing process would 
>>>>> prevent
>>>>> us from improving the old docs pages.
>>>>>
>>>>> In this conversation, let's make sure we distinguish between
>>>>> "programming guides" and "API documentation".  API docs should be 
>>>>> versioned
>>>>> and there is no question there.  Programming guides are higher level
>>>>> conceptual overviews, like the Polars user guide
>>>>> <https://docs.pola.rs/>, and should be relevant across many versions.
>>>>>
>>>>> I would also like to point out the the current programming guides are
>>>>> not consistent:
>>>>>
>>>>> * The Structured Streaming programming guide
>>>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>>>>> is one giant page
>>>>> * The SQL programming guide
>>>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html> is
>>>>> split on many pages
>>>>> * The PySpark programming guide
>>>>> <https://spark.apache.org/docs/latest/api/python/getting_started/index.html>
>>>>> takes you to a whole different URL structure and makes it so you can't 
>>>>> even
>>>>> navigate to the other programming guides anymore
>>>>>
>>>>> I am looking forward to collaborating with the community and improving
>>>>> the docs to 1. delight existing users and 2. attract new users.  Docs are 
>>>>> a
>>>>> "website problem" and we're big data people, but I'm confident we'll be
>>>>> able to work together and find a good path forward here.
>>>>>
>>>>>
>>>>> On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy <n...@ramaswamy.org>
>>>>> wrote:
>>>>>
>>>>>> Thanks all for the responses. Let me try to address everything.
>>>>>>
>>>>>> > the programming guides are also different between versions since
>>>>>> features are being added, configs are being added/ removed/ changed,
>>>>>> defaults are being changed etc.
>>>>>>
>>>>>> I agree that this is the case. But I think it's fine to mention what
>>>>>> version a feature is available in. In fact, I would argue that mentioning
>>>>>> an improvement that a version brings motivates users to upgrade more than
>>>>>> keeping docs improvement to "new releases to keep the community 
>>>>>> updating".
>>>>>> Users should upgrade to get a better Spark, not better Spark 
>>>>>> documentation.
>>>>>>
>>>>>> > having a programming guide that refers to features or API methods
>>>>>> that does not exist in that version is confusing and detrimental
>>>>>>
>>>>>> I don't think that we'd do this. Again, programming guides should
>>>>>> teach fundamentals that do not change version-to-version. TypeScript
>>>>>> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html>
>>>>>>  (which
>>>>>> has one of the best DX's and docs) does this exceptionally well.
>>>>>> Their guides are refined, versionless pages, new features are elaborated
>>>>>> upon in release notes (analogous to our version-specific docs), and for 
>>>>>> the
>>>>>> occasional caveat for a version, it is called out in the guides.
>>>>>>
>>>>>>  I agree with Wenchen's 3 points. I don't think we need to say that
>>>>>> they *have* to go to the old page, but that if they want to, they
>>>>>> can.
>>>>>>
>>>>>> Neil
>>>>>>
>>>>>> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan <cloud0...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I agree with the idea of a versionless programming guide. But one
>>>>>>> thing we need to make sure of is we give clear messages for things that 
>>>>>>> are
>>>>>>> only available in a new version. My proposal is:
>>>>>>>
>>>>>>>    1. keep the old versions' programming guide unchanged. For
>>>>>>>    example, people can still access
>>>>>>>    https://spark.apache.org/docs/3.3.4/quick-start.html
>>>>>>>    2. In the new versionless programming guide, we mention at the
>>>>>>>    beginning that for Spark versions before 4.0, go to the versioned 
>>>>>>> doc site
>>>>>>>    to read the programming guide.
>>>>>>>    3. Revisit the programming guide of Spark 4.0 (compare it with
>>>>>>>    the one of 3.5), and adjust the content to mention version-specific 
>>>>>>> changes
>>>>>>>    (API change, new features, etc.)
>>>>>>>
>>>>>>> Then we can have a versionless programming guide starting from Spark
>>>>>>> 4.0. We can also revisit programming guides of all versions and combine
>>>>>>> them into one with version-specific notes, but that's probably too much
>>>>>>> work.
>>>>>>>
>>>>>>> Any thoughts?
>>>>>>>
>>>>>>> Wenchen
>>>>>>>
>>>>>>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson <
>>>>>>> martin.anders...@kambi.com> wrote:
>>>>>>>
>>>>>>>> While I have no practical knowledge of how documentation is
>>>>>>>> maintained in the spark project, I must agree with Nimrod. For users on
>>>>>>>> older versions, having a programming guide that refers to features or 
>>>>>>>> API
>>>>>>>> methods that does not exist in that version is confusing and 
>>>>>>>> detrimental.
>>>>>>>>
>>>>>>>> Surely there must be a better way to allow updating documentation
>>>>>>>> more often?
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Martin
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> *From:* Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>> *Sent:* Wednesday, June 5, 2024 08:26
>>>>>>>> *To:* Neil Ramaswamy <n...@ramaswamy.org>
>>>>>>>> *Cc:* Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev <
>>>>>>>> dev@spark.apache.org>
>>>>>>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide
>>>>>>>> Proposal
>>>>>>>>
>>>>>>>>
>>>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>>>>> username or password.
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Neil,
>>>>>>>>
>>>>>>>>
>>>>>>>> While you wrote you don't mean the api docs (of course), the
>>>>>>>> programming guides are also different between versions since features 
>>>>>>>> are
>>>>>>>> being added, configs are being added/ removed/ changed, defaults are 
>>>>>>>> being
>>>>>>>> changed etc.
>>>>>>>>
>>>>>>>> I know of "backport hell" - which is why I wrote that once a
>>>>>>>> version is released it's freezed and the documentation will be updated 
>>>>>>>> for
>>>>>>>> the new version only.
>>>>>>>>
>>>>>>>> I think of it as facing forward and keeping older versions but
>>>>>>>> focusing on the new releases to keep the community updating.
>>>>>>>> While spark has support window of 18 months until eol, we can have
>>>>>>>> only 6 months support cycle until eol for documentation- there are no 
>>>>>>>> major
>>>>>>>> security concerns for documentation...
>>>>>>>>
>>>>>>>> Nimrod
>>>>>>>>
>>>>>>>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏<
>>>>>>>> n...@ramaswamy.org>:
>>>>>>>>
>>>>>>>> Hi Nimrod,
>>>>>>>>
>>>>>>>> Quick clarification—my proposal will not touch API-specific
>>>>>>>> documentation for the specific reasons you mentioned (signatures, 
>>>>>>>> behavior,
>>>>>>>> etc.). It just aims to make the *programming guides *versionless.
>>>>>>>> Programming guides should teach fundamentals of Spark, and the 
>>>>>>>> fundamentals
>>>>>>>> of Spark should not change between releases.
>>>>>>>>
>>>>>>>> There are a few issues with updating documentation multiple times
>>>>>>>> after Spark releases. First, fixes that apply to all existing versions'
>>>>>>>> programming guides need backport PRs. For example, this change
>>>>>>>> <https://github.com/apache/spark/pull/46797/files> applies to all
>>>>>>>> the versions of the SS programming guide, but is likely to be fixed 
>>>>>>>> only in
>>>>>>>> Spark 4.0. Additionally, any such update within a Spark release will 
>>>>>>>> require
>>>>>>>> re-building the static sites in the spark repo, and copying those 
>>>>>>>> files to
>>>>>>>> spark-website via a commit in spark-website. Making a typo fix like 
>>>>>>>> the one
>>>>>>>> I linked would then require <number of versions we want to update> + 1 
>>>>>>>> PRs,
>>>>>>>> opposed to 1 PR in the versionless programming guide world.
>>>>>>>>
>>>>>>>> Neil
>>>>>>>>
>>>>>>>> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> While I think that the documentation needs a lot of improvement and
>>>>>>>> important details are missing - and detaching the documentation from 
>>>>>>>> the
>>>>>>>> main project can help iterating faster on documentation specific 
>>>>>>>> tasks, I
>>>>>>>> don't think we can nor should move to versionless documentation.
>>>>>>>>
>>>>>>>> Documentation is version specific: parameters are added and
>>>>>>>> removed, new features are added, behaviours sometimes change etc.
>>>>>>>>
>>>>>>>> I think the documentation should be version specific- but separate
>>>>>>>> from spark release cadence - and can be updated multiple times after 
>>>>>>>> spark
>>>>>>>> release.
>>>>>>>> The way I see it is that the documentation should be updated only
>>>>>>>> for the latest version and some time before a new release should be
>>>>>>>> archived and the updated documentation should reflect the new version.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Nimrod
>>>>>>>>
>>>>>>>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
>>>>>>>> ‏<praveen.ga...@databricks.com.invalid>:
>>>>>>>>
>>>>>>>> +1. This helps for greater velocity in improving docs. However, we
>>>>>>>> might still need a way to provide version specific information isn't 
>>>>>>>> it,
>>>>>>>> i.e. what features are available in which version etc.
>>>>>>>>
>>>>>>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy <n...@ramaswamy.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I've written up a proposal to migrate all the Apache Spark
>>>>>>>> programming guides to be versionless. You can find the proposal
>>>>>>>> here
>>>>>>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>.
>>>>>>>> Please leave comments, or reply in this DISCUSS thread.
>>>>>>>>
>>>>>>>> TLDR: by making the programming guides versionless, we can make
>>>>>>>> updates to them whenever we'd like, instead of at the Spark release
>>>>>>>> cadence. This increased update velocity will enable us to make gradual
>>>>>>>> improvements, including breaking up the Structured Streaming 
>>>>>>>> programming
>>>>>>>> guide into smaller sub-guides. The proposal does not break *any 
>>>>>>>> *existing
>>>>>>>> URLs, and it does not affect our versioned API docs in any way.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>> Neil
>>>>>>>>
>>>>>>>> CONFIDENTIALITY NOTICE: This email message (and any attachment) is
>>>>>>>> intended only for the individual or entity to which it is addressed. 
>>>>>>>> The
>>>>>>>> information in this email is confidential and may contain information 
>>>>>>>> that
>>>>>>>> is legally privileged or exempt from disclosure under applicable law. 
>>>>>>>> If
>>>>>>>> you are not the intended recipient, you are strictly prohibited from
>>>>>>>> reading, using, publishing or disseminating such information and upon
>>>>>>>> receipt, must permanently delete the original and destroy any copies. 
>>>>>>>> We
>>>>>>>> take steps to protect against viruses and other defects but advise you 
>>>>>>>> to
>>>>>>>> carry out your own checks and precautions as Kambi does not accept any
>>>>>>>> liability for any which remain. Thank you for your co-operation.
>>>>>>>>
>>>>>>>
>>>

Reply via email to