Just FYI, the Hive languages manual is also version-less:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual

It's not a strong data point as this doc is not actively updated, but my
personal feeling is that it's nice to see the history of a feature: when it
was introduced, when it got changed, with JIRA ticket linked.

One potential issue is that if a feature has been changed 100 times in
history, it's too verbose to document all 100 different behaviors for
different versions. If that happens, I think we can make each major version
have its own programming guide, assuming we won't change a feature 100
times in Spark 4 :)

On Mon, Jun 10, 2024 at 1:08 PM Nimrod Ofek <ofek.nim...@gmail.com> wrote:

> My personal opinion is that having the documents per version (current and
> previous), without fixing previous versions - just keeping them as a
> snapshot in time of the current documentation once the new version was
> released, should be good enough.
>
> Because now Neil would like to change the documentation (personally I
> think it's very needed and it's a great thing to do) - there will be a big
> gap between the old documents and the new ones...
> If after rewriting and rearenging the documents someone would feel it can
> be beneficial to port back the documentation for some of the older versions
> as well as a one time thing, that's possible as well of course...
>
> I find this solution to be best of all worlds - versioned, so you can read
> documents which are relevant to the version you use (though I am in favour
> of working on updated versions and not working with old versions anyway),
> while the documentation can be updated many times, after the release and
> independently from the actual release of Spark.
>
> I think that keeping one document to support all versions will soon become
> hard to read and understand with little benefit of having updated
> documentation for old versions.
>
>
> Regarding SEO and deranking, afaik updating the documentation more
> frequently should only improve ranking so the latest documentation should
> always be ranked high in Google search, but maybe I'm missing something.
>
> Nimrod
>
>
>
> בתאריך יום ב׳, 10 ביוני 2024, 21:25, מאת Nicholas Chammas ‏<
> nicholas.cham...@gmail.com>:
>
>> I will let Neil and Matt clarify the details because I believe they
>> understand the overall picture better. However, I would like to emphasize
>> something that motivated this effort and which may be getting lost in the
>> concerns about versioned vs. versionless docs.
>>
>> The main problem is that some of the guides need major overhauls.
>>
>> There are people like Neil who are interested in making significant
>> contributions to the guides. What is holding them back is that major
>> changes to the web docs can trigger wholesale deranking of our site by
>> Google. Since versioned docs are tied to Spark releases, which are
>> infrequent, that means potentially being nuked in the search rankings for
>> months.
>>
>> Versionless docs allow for rapid iteration on the guides, which can be
>> driven in part by search rankings.
>>
>> In other words, there is a problem chain here that leads to versionless
>> docs:
>>
>> 1. Several guides need major improvements.
>> 2. We cannot make such improvements because a) that would risk site
>> deranking, and b) we are constrained by Spark's release schedule.
>> 3. Versionless guides allow for incremental improvements, which addresses
>> problems 2a and 2b.
>>
>> This is my understanding of the big picture as described to me by Neil
>> and Matt. I defer to them to elaborate on the details, especially in
>> relation to Google site rankings. If this concern is not valid or not that
>> serious, then we can just iterate slowly on the docs with Spark’s existing
>> release schedule and there is less need for versionless docs.
>>
>> Nick
>>
>>
>> On Jun 10, 2024, at 1:53 PM, Mridul Muralidharan <mri...@gmail.com>
>> wrote:
>>
>>
>> Hi,
>>
>>   Versioned documentation has the benefit that users can have reasonable
>> confidence that features, functionality and examples mentioned will work
>> with that released Spark version.
>> A versionless guide runs into potential issues with deprecation,
>> behavioral changes and new features.
>>
>> My concern is not just around features highlighting their supported
>> versions, but examples which reference others features in spark.
>>
>> For example, sql differences between hive ql and ansi sql when we flip
>> the default in 4.0 : we would have 4.x example snippets for some feature
>> (say UDAF) which would not work for 3.x and vice versa.
>>
>> Regards,
>> Mridul
>>
>>
>> On Mon, Jun 10, 2024 at 12:03 PM Hyukjin Kwon <gurwls...@apache.org>
>> wrote:
>>
>>> I am +1 on this but as you guys mentioned, we should really be clear on
>>> how to address different versions.
>>>
>>> On Wed, 5 Jun 2024 at 18:27, Matthew Powers <
>>> matthewkevinpow...@gmail.com> wrote:
>>>
>>>> I am a huge fan of the Apache Spark docs and I regularly look at the
>>>> analytics on this page
>>>> <https://analytics.apache.org/index.php?module=CoreHome&action=index&date=yesterday&period=day&idSite=40#?period=day&date=yesterday&category=Dashboard_Dashboard&subcategory=1>
>>>> to see how well they are doing.  Great work to everyone that's contributed
>>>> to the docs over the years.
>>>>
>>>> We've been chipping away with some improvements over the past year and
>>>> have made good progress.  For example, lots of the pages were missing
>>>> canonical links.  Canonical links are a special type of link that are
>>>> extremely important for any site that has duplicate content.  Versioned
>>>> documentation sites have lots of duplicate pages, so getting these
>>>> canonical links added was important.  It wasn't really easy to make this
>>>> change though.
>>>>
>>>> The current site is confusing Google a bit.  If you do a "spark
>>>> rocksdb" Google search for example, you get the Spark 3.2 Structured
>>>> Streaming Programming Guide as the first result (because Google isn't
>>>> properly indexing the docs).  You need to Control+F and search for
>>>> "rocksdb" to navigate to the relevant section which says: "As of Spark
>>>> 3.2, we add a new built-in state store implementation...", which is
>>>> what you'd expect in a versionless docs site in any case.
>>>>
>>>> There are two different user experiences:
>>>>
>>>> * Option A: push Spark 3.1 Structured Streaming users to the Spark 3.1
>>>> Structured Streaming Programming guide that doesn't mention RocksDB
>>>> * Option B: push Spark Structured Streaming users to the latest
>>>> Structure Streaming Programming guide, which mentions RocksDB, but caveat
>>>> that this feature was added in Spark 3.2
>>>>
>>>> I think Option B provides Spark 3.1 users a better experience overall.
>>>> It's better to let users know they can access RocksDB by upgrading than
>>>> hiding this info from them IMO.
>>>>
>>>> Now if we want Option A, then we'd need to give users a reasonable way
>>>> to actually navigate to the Spark 3.1 docs.  From what I can tell, the only
>>>> way to navigate from the latest Structured Streaming Programming Guide
>>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>>>> to a different version is by manually updating the URL.
>>>>
>>>> I was just skimming over the Structured Streaming Programming guide and
>>>> noticing again how lots of the Python code snippets aren't PEP 8
>>>> compliant.  It seems like our current docs publishing process would prevent
>>>> us from improving the old docs pages.
>>>>
>>>> In this conversation, let's make sure we distinguish between
>>>> "programming guides" and "API documentation".  API docs should be versioned
>>>> and there is no question there.  Programming guides are higher level
>>>> conceptual overviews, like the Polars user guide
>>>> <https://docs.pola.rs/>, and should be relevant across many versions.
>>>>
>>>> I would also like to point out the the current programming guides are
>>>> not consistent:
>>>>
>>>> * The Structured Streaming programming guide
>>>> <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>>>> is one giant page
>>>> * The SQL programming guide
>>>> <https://spark.apache.org/docs/latest/sql-programming-guide.html> is
>>>> split on many pages
>>>> * The PySpark programming guide
>>>> <https://spark.apache.org/docs/latest/api/python/getting_started/index.html>
>>>> takes you to a whole different URL structure and makes it so you can't even
>>>> navigate to the other programming guides anymore
>>>>
>>>> I am looking forward to collaborating with the community and improving
>>>> the docs to 1. delight existing users and 2. attract new users.  Docs are a
>>>> "website problem" and we're big data people, but I'm confident we'll be
>>>> able to work together and find a good path forward here.
>>>>
>>>>
>>>> On Wed, Jun 5, 2024 at 3:22 PM Neil Ramaswamy <n...@ramaswamy.org>
>>>> wrote:
>>>>
>>>>> Thanks all for the responses. Let me try to address everything.
>>>>>
>>>>> > the programming guides are also different between versions since
>>>>> features are being added, configs are being added/ removed/ changed,
>>>>> defaults are being changed etc.
>>>>>
>>>>> I agree that this is the case. But I think it's fine to mention what
>>>>> version a feature is available in. In fact, I would argue that mentioning
>>>>> an improvement that a version brings motivates users to upgrade more than
>>>>> keeping docs improvement to "new releases to keep the community updating".
>>>>> Users should upgrade to get a better Spark, not better Spark 
>>>>> documentation.
>>>>>
>>>>> > having a programming guide that refers to features or API methods
>>>>> that does not exist in that version is confusing and detrimental
>>>>>
>>>>> I don't think that we'd do this. Again, programming guides should
>>>>> teach fundamentals that do not change version-to-version. TypeScript
>>>>> <https://www.typescriptlang.org/docs/handbook/typescript-from-scratch.html>
>>>>>  (which
>>>>> has one of the best DX's and docs) does this exceptionally well.
>>>>> Their guides are refined, versionless pages, new features are elaborated
>>>>> upon in release notes (analogous to our version-specific docs), and for 
>>>>> the
>>>>> occasional caveat for a version, it is called out in the guides.
>>>>>
>>>>>  I agree with Wenchen's 3 points. I don't think we need to say that
>>>>> they *have* to go to the old page, but that if they want to, they can.
>>>>>
>>>>> Neil
>>>>>
>>>>> On Wed, Jun 5, 2024 at 12:04 PM Wenchen Fan <cloud0...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I agree with the idea of a versionless programming guide. But one
>>>>>> thing we need to make sure of is we give clear messages for things that 
>>>>>> are
>>>>>> only available in a new version. My proposal is:
>>>>>>
>>>>>>    1. keep the old versions' programming guide unchanged. For
>>>>>>    example, people can still access
>>>>>>    https://spark.apache.org/docs/3.3.4/quick-start.html
>>>>>>    2. In the new versionless programming guide, we mention at the
>>>>>>    beginning that for Spark versions before 4.0, go to the versioned doc 
>>>>>> site
>>>>>>    to read the programming guide.
>>>>>>    3. Revisit the programming guide of Spark 4.0 (compare it with
>>>>>>    the one of 3.5), and adjust the content to mention version-specific 
>>>>>> changes
>>>>>>    (API change, new features, etc.)
>>>>>>
>>>>>> Then we can have a versionless programming guide starting from Spark
>>>>>> 4.0. We can also revisit programming guides of all versions and combine
>>>>>> them into one with version-specific notes, but that's probably too much
>>>>>> work.
>>>>>>
>>>>>> Any thoughts?
>>>>>>
>>>>>> Wenchen
>>>>>>
>>>>>> On Wed, Jun 5, 2024 at 1:39 AM Martin Andersson <
>>>>>> martin.anders...@kambi.com> wrote:
>>>>>>
>>>>>>> While I have no practical knowledge of how documentation is
>>>>>>> maintained in the spark project, I must agree with Nimrod. For users on
>>>>>>> older versions, having a programming guide that refers to features or 
>>>>>>> API
>>>>>>> methods that does not exist in that version is confusing and 
>>>>>>> detrimental.
>>>>>>>
>>>>>>> Surely there must be a better way to allow updating documentation
>>>>>>> more often?
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Martin
>>>>>>>
>>>>>>> ------------------------------
>>>>>>> *From:* Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>> *Sent:* Wednesday, June 5, 2024 08:26
>>>>>>> *To:* Neil Ramaswamy <n...@ramaswamy.org>
>>>>>>> *Cc:* Praveen Gattu <praveen.ga...@databricks.com.invalid>; dev <
>>>>>>> dev@spark.apache.org>
>>>>>>> *Subject:* Re: [DISCUSS] Versionless Spark Programming Guide
>>>>>>> Proposal
>>>>>>>
>>>>>>>
>>>>>>> EXTERNAL SENDER. Do not click links or open attachments unless you
>>>>>>> recognize the sender and know the content is safe. DO NOT provide your
>>>>>>> username or password.
>>>>>>>
>>>>>>>
>>>>>>> Hi Neil,
>>>>>>>
>>>>>>>
>>>>>>> While you wrote you don't mean the api docs (of course), the
>>>>>>> programming guides are also different between versions since features 
>>>>>>> are
>>>>>>> being added, configs are being added/ removed/ changed, defaults are 
>>>>>>> being
>>>>>>> changed etc.
>>>>>>>
>>>>>>> I know of "backport hell" - which is why I wrote that once a version
>>>>>>> is released it's freezed and the documentation will be updated for the 
>>>>>>> new
>>>>>>> version only.
>>>>>>>
>>>>>>> I think of it as facing forward and keeping older versions but
>>>>>>> focusing on the new releases to keep the community updating.
>>>>>>> While spark has support window of 18 months until eol, we can have
>>>>>>> only 6 months support cycle until eol for documentation- there are no 
>>>>>>> major
>>>>>>> security concerns for documentation...
>>>>>>>
>>>>>>> Nimrod
>>>>>>>
>>>>>>> בתאריך יום ד׳, 5 ביוני 2024, 08:28, מאת Neil Ramaswamy ‏<
>>>>>>> n...@ramaswamy.org>:
>>>>>>>
>>>>>>> Hi Nimrod,
>>>>>>>
>>>>>>> Quick clarification—my proposal will not touch API-specific
>>>>>>> documentation for the specific reasons you mentioned (signatures, 
>>>>>>> behavior,
>>>>>>> etc.). It just aims to make the *programming guides *versionless.
>>>>>>> Programming guides should teach fundamentals of Spark, and the 
>>>>>>> fundamentals
>>>>>>> of Spark should not change between releases.
>>>>>>>
>>>>>>> There are a few issues with updating documentation multiple times
>>>>>>> after Spark releases. First, fixes that apply to all existing versions'
>>>>>>> programming guides need backport PRs. For example, this change
>>>>>>> <https://github.com/apache/spark/pull/46797/files> applies to all
>>>>>>> the versions of the SS programming guide, but is likely to be fixed 
>>>>>>> only in
>>>>>>> Spark 4.0. Additionally, any such update within a Spark release will 
>>>>>>> require
>>>>>>> re-building the static sites in the spark repo, and copying those files 
>>>>>>> to
>>>>>>> spark-website via a commit in spark-website. Making a typo fix like the 
>>>>>>> one
>>>>>>> I linked would then require <number of versions we want to update> + 1 
>>>>>>> PRs,
>>>>>>> opposed to 1 PR in the versionless programming guide world.
>>>>>>>
>>>>>>> Neil
>>>>>>>
>>>>>>> On Tue, Jun 4, 2024 at 1:32 PM Nimrod Ofek <ofek.nim...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> While I think that the documentation needs a lot of improvement and
>>>>>>> important details are missing - and detaching the documentation from the
>>>>>>> main project can help iterating faster on documentation specific tasks, 
>>>>>>> I
>>>>>>> don't think we can nor should move to versionless documentation.
>>>>>>>
>>>>>>> Documentation is version specific: parameters are added and removed,
>>>>>>> new features are added, behaviours sometimes change etc.
>>>>>>>
>>>>>>> I think the documentation should be version specific- but separate
>>>>>>> from spark release cadence - and can be updated multiple times after 
>>>>>>> spark
>>>>>>> release.
>>>>>>> The way I see it is that the documentation should be updated only
>>>>>>> for the latest version and some time before a new release should be
>>>>>>> archived and the updated documentation should reflect the new version.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nimrod
>>>>>>>
>>>>>>> בתאריך יום ג׳, 4 ביוני 2024, 18:34, מאת Praveen Gattu
>>>>>>> ‏<praveen.ga...@databricks.com.invalid>:
>>>>>>>
>>>>>>> +1. This helps for greater velocity in improving docs. However, we
>>>>>>> might still need a way to provide version specific information isn't it,
>>>>>>> i.e. what features are available in which version etc.
>>>>>>>
>>>>>>> On Mon, Jun 3, 2024 at 3:08 PM Neil Ramaswamy <n...@ramaswamy.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> I've written up a proposal to migrate all the Apache Spark
>>>>>>> programming guides to be versionless. You can find the proposal here
>>>>>>> <https://docs.google.com/document/d/1OqeQ71zZleUa1XRZrtaPDFnJ-gVJdGM80o42yJVg9zg/>.
>>>>>>> Please leave comments, or reply in this DISCUSS thread.
>>>>>>>
>>>>>>> TLDR: by making the programming guides versionless, we can make
>>>>>>> updates to them whenever we'd like, instead of at the Spark release
>>>>>>> cadence. This increased update velocity will enable us to make gradual
>>>>>>> improvements, including breaking up the Structured Streaming programming
>>>>>>> guide into smaller sub-guides. The proposal does not break *any 
>>>>>>> *existing
>>>>>>> URLs, and it does not affect our versioned API docs in any way.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Neil
>>>>>>>
>>>>>>> CONFIDENTIALITY NOTICE: This email message (and any attachment) is
>>>>>>> intended only for the individual or entity to which it is addressed. The
>>>>>>> information in this email is confidential and may contain information 
>>>>>>> that
>>>>>>> is legally privileged or exempt from disclosure under applicable law. If
>>>>>>> you are not the intended recipient, you are strictly prohibited from
>>>>>>> reading, using, publishing or disseminating such information and upon
>>>>>>> receipt, must permanently delete the original and destroy any copies. We
>>>>>>> take steps to protect against viruses and other defects but advise you 
>>>>>>> to
>>>>>>> carry out your own checks and precautions as Kambi does not accept any
>>>>>>> liability for any which remain. Thank you for your co-operation.
>>>>>>>
>>>>>>
>>

Reply via email to