Just FYI, the Hive languages manual is also version-less: https://cwiki.apache.org/confluence/display/Hive/LanguageManual
It's not a strong data point as this doc is not actively updated, but my personal feeling is that it's nice to see the history of a feature: when it was introduced, when it got changed, with JIRA ticket linked. One potential issue is that if a feature has been changed 100 times in history, it's too verbose to document all 100 different behaviors for different versions. If that happens, I think we can make each major version have its own programming guide, assuming we won't change a feature 100 times in Spark 4 :) On Mon, Jun 10, 2024 at 1:08 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote: > My personal opinion is that having the documents per version (current and > previous), without fixing previous versions - just keeping them as a > snapshot in time of the current documentation once the new version was > released, should be good enough. > > Because now Neil would like to change the documentation (personally I > think it's very needed and it's a great thing to do) - there will be a big > gap between the old documents and the new ones... > If after rewriting and rearenging the documents someone would feel it can > be beneficial to port back the documentation for some of the older versions > as well as a one time thing, that's possible as well of course... > > I find this solution to be best of all worlds - versioned, so you can read > documents which are relevant to the version you use (though I am in favour > of working on updated versions and not working with old versions anyway), > while the documentation can be updated many times, after the release and > independently from the actual release of Spark. > > I think that keeping one document to support all versions will soon become > hard to read and understand with little benefit of having updated > documentation for old versions. > > > Regarding SEO and deranking, afaik updating the documentation more > frequently should only improve ranking so the latest documentation should > always be ranked high in Google search, but maybe I'm missing something. > > Nimrod > > > > בתאריך יום ב׳, 10 ביוני 2024, 21:25, מאת Nicholas Chammas < > nicholas.cham...@gmail.com>: > >> I will let Neil and Matt clarify the details because I believe they >> understand the overall picture better. However, I would like to emphasize >> something that motivated this effort and which may be getting lost in the >> concerns about versioned vs. versionless docs. >> >> The main problem is that some of the guides need major overhauls. >> >> There are people like Neil who are interested in making significant >> contributions to the guides. What is holding them back is that major >> changes to the web docs can trigger wholesale deranking of our site by >> Google. Since versioned docs are tied to Spark releases, which are >> infrequent, that means potentially being nuked in the search rankings for >> months. >> >> Versionless docs allow for rapid iteration on the guides, which can be >> driven in part by search rankings. >> >> In other words, there is a problem chain here that leads to versionless >> docs: >> >> 1. Several guides need major improvements. >> 2. We cannot make such improvements because a) that would risk site >> deranking, and b) we are constrained by Spark's release schedule. >> 3. Versionless guides allow for incremental improvements, which addresses >> problems 2a and 2b. >> >> This is my understanding of the big picture as described to me by Neil >> and Matt. I defer to them to elaborate on the details, especially in >> relation to Google site rankings. If this concern is not valid or not that >> serious, then we can just iterate slowly on the docs with Spark’s existing >> release schedule and there is less need for versionless docs. >> >> Nick >> >> >> On Jun 10, 2024, at 1:53 PM, Mridul Muralidharan <mri...@gmail.com> >> wrote: >> >> >> Hi, >> >> Versioned documentation has the benefit that users can have reasonable >> confidence that features, functionality and examples mentioned will work >> with that released Spark version. >> A versionless guide runs into potential issues with deprecation, >> behavioral changes and new features. >> >> My concern is not just around features highlighting their supported >> versions, but examples which reference others features in spark. >> >> For example, sql differences between hive ql and ansi sql when we flip >> the default in 4.0 : we would have 4.x example snippets for some feature >> (say UDAF) which would not work for 3.x and vice versa. >> >> Regards, >> Mridul >> >> >> On Mon, Jun 10, 2024 at 12:03 PM Hyukjin Kwon <gurwls...@apache.org> >> wrote: >> >>> I am +1 on this but as you guys mentioned, we should really be clear on >>> how to address different versions. >>> >>> On Wed, 5 Jun 2024 at 18:27, Matthew Powers < >>> matthewkevinpow...@gmail.com> wrote: >>> >>>> I am a huge fan of the Apache Spark docs and I regularly look at the >>>> analytics on this page >>>> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1> >>>> to see how well they are doing. Great work to everyone that's contributed >>>> to the docs over the years. >>>> >>>> We've been chipping away with some improvements over the past year and >>>> have made good progress. For example, lots of the pages were missing >>>> canonical links. Canonical links are a special type of link that are >>>> extremely important for any site that has duplicate content. Versioned >>>> documentation sites have lots of duplicate pages, so getting these >>>> canonical links added was important. It wasn't really easy to make this >>>> change though. >>>> >>>> The current site is confusing Google a bit. If you do a "spark >>>> rocksdb" Google search for example, you get the Spark 3.2 Structured >>>> Streaming Programming Guide as the first result (because Google isn't >>>> properly indexing the docs). You need to Control+F and search for >>>> "rocksdb" to navigate to the relevant section which says: "As of Spark >>>> 3.2, we add a new built-in state store implementation...", which is >>>> what you'd expect in a versionless docs site in any case. >>>> >>>> There are two different user experiences: >>>> >>>> * Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1 >>>> Structured Streaming Programming guide that doesn't mention RocksDB >>>> * Option B: push Spark Structured Streaming users to the latest >>>> Structure Streaming Programming guide, which mentions RocksDB, but caveat >>>> that this feature was added in Spark 3.2 >>>> >>>> I think Option B provides Spark 3.1 users a better experience overall. >>>> It's better to let users know they can access RocksDB by upgrading than >>>> hiding this info from them IMO. >>>> >>>> Now if we want Option A, then we'd need to give users a reasonable way >>>> to actually navigate to the Spark 3.1 docs. From what I can tell, the only >>>> way to navigate from the latest Structured Streaming Programming Guide >>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> >>>> to a different version is by manually updating the URL. >>>> >>>> I was just skimming over the Structured Streaming Programming guide and >>>> noticing again how lots of the Python code snippets aren't PEP 8 >>>> compliant. It seems like our current docs publishing process would prevent >>>> us from improving the old docs pages. >>>> >>>> In this conversation, let's make sure we distinguish between >>>> "programming guides" and "API documentation". API docs should be versioned >>>> and there is no question there. Programming guides are higher level >>>> conceptual overviews, like the Polars user guide >>>> <https://docs.pola.rs/>, and should be relevant across many versions. >>>> >>>> I would also like to point out the the current programming guides are >>>> not consistent: >>>> >>>> * The Structured Streaming programming guide >>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html> >>>> is one giant page >>>> * The SQL programming guide >>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html> is >>>> split on many pages >>>> * The PySpark programming guide >>>> <https://spark.apache.org/docs/latest/api/python/getting_started/index.html> >>>> takes you to a whole different URL structure and makes it so you can't even >>>> navigate to the other programming guides anymore >>>> >>>> I am looking forward to collaborating with the community and improving >>>> the docs to 1. delight existing users and 2. attract new users. Docs are a >>>> "website problem" and we're big data people, but I'm confident we'll be >>>> able to work together and find a good path forward here. >>>> >>>> >>>> On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy <n...@ramaswamy.org> >>>> wrote: >>>> >>>>> Thanks all for the responses. Let me try to address everything. >>>>> >>>>> > the programming guides are also different between versions since >>>>> features are being added, configs are being added/ removed/ changed, >>>>> defaults are being changed etc. >>>>> >>>>> I agree that this is the case. But I think it's fine to mention what >>>>> version a feature is available in. In fact, I would argue that mentioning >>>>> an improvement that a version brings motivates users to upgrade more than >>>>> keeping docs improvement to "new releases to keep the community updating". >>>>> Users should upgrade to get a better Spark, not better Spark >>>>> documentation. >>>>> >>>>> > having a programming guide that refers to features or API methods >>>>> that does not exist in that version is confusing and detrimental >>>>> >>>>> I don't think that we'd do this. Again, programming guides should >>>>> teach fundamentals that do not change version-to-version. TypeScript >>>>> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html> >>>>> (which >>>>> has one of the best DX's and docs) does this exceptionally well. >>>>> Their guides are refined, versionless pages, new features are elaborated >>>>> upon in release notes (analogous to our version-specific docs), and for >>>>> the >>>>> occasional caveat for a version, it is called out in the guides. >>>>> >>>>> I agree with Wenchen's 3 points. I don't think we need to say that >>>>> they *have* to go to the old page, but that if they want to, they can. >>>>> >>>>> Neil >>>>> >>>>> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan <cloud0...@gmail.com> >>>>> wrote: >>>>> >>>>>> I agree with the idea of a versionless programming guide. But one >>>>>> thing we need to make sure of is we give clear messages for things that >>>>>> are >>>>>> only available in a new version. My proposal is: >>>>>> >>>>>> 1. keep the old versions' programming guide unchanged. For >>>>>> example, people can still access >>>>>> https://spark.apache.org/docs/3.3.4/quick-start.html >>>>>> 2. In the new versionless programming guide, we mention at the >>>>>> beginning that for Spark versions before 4.0, go to the versioned doc >>>>>> site >>>>>> to read the programming guide. >>>>>> 3. Revisit the programming guide of Spark 4.0 (compare it with >>>>>> the one of 3.5), and adjust the content to mention version-specific >>>>>> changes >>>>>> (API change, new features, etc.) >>>>>> >>>>>> Then we can have a versionless programming guide starting from Spark >>>>>> 4.0. We can also revisit programming guides of all versions and combine >>>>>> them into one with version-specific notes, but that's probably too much >>>>>> work. >>>>>> >>>>>> Any thoughts? >>>>>> >>>>>> Wenchen >>>>>> >>>>>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson < >>>>>> martin.anders...@kambi.com> wrote: >>>>>> >>>>>>> While I have no practical knowledge of how documentation is >>>>>>> maintained in the spark project, I must agree with Nimrod. For users on >>>>>>> older versions, having a programming guide that refers to features or >>>>>>> API >>>>>>> methods that does not exist in that version is confusing and >>>>>>> detrimental. >>>>>>> >>>>>>> Surely there must be a better way to allow updating documentation >>>>>>> more often? >>>>>>> >>>>>>> Best Regards, >>>>>>> Martin >>>>>>> >>>>>>> ------------------------------ >>>>>>> *From:* Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> *Sent:* Wednesday, June 5, 2024 08:26 >>>>>>> *To:* Neil Ramaswamy <n...@ramaswamy.org> >>>>>>> *Cc:* Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev < >>>>>>> dev@spark.apache.org> >>>>>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide >>>>>>> Proposal >>>>>>> >>>>>>> >>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you >>>>>>> recognize the sender and know the content is safe. DO NOT provide your >>>>>>> username or password. >>>>>>> >>>>>>> >>>>>>> Hi Neil, >>>>>>> >>>>>>> >>>>>>> While you wrote you don't mean the api docs (of course), the >>>>>>> programming guides are also different between versions since features >>>>>>> are >>>>>>> being added, configs are being added/ removed/ changed, defaults are >>>>>>> being >>>>>>> changed etc. >>>>>>> >>>>>>> I know of "backport hell" - which is why I wrote that once a version >>>>>>> is released it's freezed and the documentation will be updated for the >>>>>>> new >>>>>>> version only. >>>>>>> >>>>>>> I think of it as facing forward and keeping older versions but >>>>>>> focusing on the new releases to keep the community updating. >>>>>>> While spark has support window of 18 months until eol, we can have >>>>>>> only 6 months support cycle until eol for documentation- there are no >>>>>>> major >>>>>>> security concerns for documentation... >>>>>>> >>>>>>> Nimrod >>>>>>> >>>>>>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy < >>>>>>> n...@ramaswamy.org>: >>>>>>> >>>>>>> Hi Nimrod, >>>>>>> >>>>>>> Quick clarification—my proposal will not touch API-specific >>>>>>> documentation for the specific reasons you mentioned (signatures, >>>>>>> behavior, >>>>>>> etc.). It just aims to make the *programming guides *versionless. >>>>>>> Programming guides should teach fundamentals of Spark, and the >>>>>>> fundamentals >>>>>>> of Spark should not change between releases. >>>>>>> >>>>>>> There are a few issues with updating documentation multiple times >>>>>>> after Spark releases. First, fixes that apply to all existing versions' >>>>>>> programming guides need backport PRs. For example, this change >>>>>>> <https://github.com/apache/spark/pull/46797/files> applies to all >>>>>>> the versions of the SS programming guide, but is likely to be fixed >>>>>>> only in >>>>>>> Spark 4.0. Additionally, any such update within a Spark release will >>>>>>> require >>>>>>> re-building the static sites in the spark repo, and copying those files >>>>>>> to >>>>>>> spark-website via a commit in spark-website. Making a typo fix like the >>>>>>> one >>>>>>> I linked would then require <number of versions we want to update> + 1 >>>>>>> PRs, >>>>>>> opposed to 1 PR in the versionless programming guide world. >>>>>>> >>>>>>> Neil >>>>>>> >>>>>>> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek <ofek.nim...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> While I think that the documentation needs a lot of improvement and >>>>>>> important details are missing - and detaching the documentation from the >>>>>>> main project can help iterating faster on documentation specific tasks, >>>>>>> I >>>>>>> don't think we can nor should move to versionless documentation. >>>>>>> >>>>>>> Documentation is version specific: parameters are added and removed, >>>>>>> new features are added, behaviours sometimes change etc. >>>>>>> >>>>>>> I think the documentation should be version specific- but separate >>>>>>> from spark release cadence - and can be updated multiple times after >>>>>>> spark >>>>>>> release. >>>>>>> The way I see it is that the documentation should be updated only >>>>>>> for the latest version and some time before a new release should be >>>>>>> archived and the updated documentation should reflect the new version. >>>>>>> >>>>>>> Thanks, >>>>>>> Nimrod >>>>>>> >>>>>>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu >>>>>>> <praveen.ga...@databricks.com.invalid>: >>>>>>> >>>>>>> +1. This helps for greater velocity in improving docs. However, we >>>>>>> might still need a way to provide version specific information isn't it, >>>>>>> i.e. what features are available in which version etc. >>>>>>> >>>>>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy <n...@ramaswamy.org> >>>>>>> wrote: >>>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> I've written up a proposal to migrate all the Apache Spark >>>>>>> programming guides to be versionless. You can find the proposal here >>>>>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>. >>>>>>> Please leave comments, or reply in this DISCUSS thread. >>>>>>> >>>>>>> TLDR: by making the programming guides versionless, we can make >>>>>>> updates to them whenever we'd like, instead of at the Spark release >>>>>>> cadence. This increased update velocity will enable us to make gradual >>>>>>> improvements, including breaking up the Structured Streaming programming >>>>>>> guide into smaller sub-guides. The proposal does not break *any >>>>>>> *existing >>>>>>> URLs, and it does not affect our versioned API docs in any way. >>>>>>> >>>>>>> Thanks! >>>>>>> Neil >>>>>>> >>>>>>> CONFIDENTIALITY NOTICE: This email message (and any attachment) is >>>>>>> intended only for the individual or entity to which it is addressed. The >>>>>>> information in this email is confidential and may contain information >>>>>>> that >>>>>>> is legally privileged or exempt from disclosure under applicable law. If >>>>>>> you are not the intended recipient, you are strictly prohibited from >>>>>>> reading, using, publishing or disseminating such information and upon >>>>>>> receipt, must permanently delete the original and destroy any copies. We >>>>>>> take steps to protect against viruses and other defects but advise you >>>>>>> to >>>>>>> carry out your own checks and precautions as Kambi does not accept any >>>>>>> liability for any which remain. Thank you for your co-operation. >>>>>>> >>>>>> >>