Yes, indeed, one or two LLM.txt index manifest wouldn’t hurt, especially if it facilitates LLM searches. Though not at standard yet, but it’s gaining attention: https://directory.llmstxt.cloud/

Cheers 
Jules 
Sent from my iPhone
Pardon the dumb thumb typos :)

On Sep 10, 2025, at 4:11 PM, Hyukjin Kwon <gurwls...@apache.org> wrote:


I am +1 if we're sure that it's adding one or only a few files,

On Thu, 11 Sept 2025 at 06:53, Denny Lee <denny.g....@gmail.com> wrote:
While it is not standard per se, it is quickly becoming a common approach.  And as you noted per MCP site, they have the llms-full.txt, they also have 


On Wed, Sep 10, 2025 at 14:48 Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote:
The protocol for this llms.txt is not a standard yet. 

"To clarify, llms.txt is not meant to be a duplication of the full documentation.
Some like the Model Context Protocol (MCP) site have their full web page in the llms page.  



ons. 10. sep. 2025 kl. 22:27 skrev Allison Wang <allison.w...@databricks.com.invalid>:

Thanks Dongjoon for raising these concerns. I agree with your point that it’s worth making the lightweight manifest scope explicit in the SPIP so we have a systematic guarantee it stays small (under 10MB).

To clarify, llms.txt is not meant to be a duplication of the full documentation. Instead, it acts more like an index or table of contents page: a small, curated manifest that points to existing canonical docs. The intent is to help AI-assisted tools and LLMs discover the right entry points, not to repackage the entire documentation set.

For example this DuckDB's llms.txt file is around 30KB in size. Spark’s manifests will likely be a bit larger given the broader scope of APIs and documentation, but they should still remain lightweight link-only markdown files and well under the 10MB limit, even across multiple versions and language scopes.


On Wed, Sep 10, 2025 at 8:47 AM Wenchen Fan <cloud0...@gmail.com> wrote:
This should just be a llm-facing index page of Spark docs? Given the amount of APIs Spark provides today, I think this index page should be useful to humans as well.

On Wed, Sep 10, 2025 at 10:46 PM Dongjoon Hyun <dongjoon.h...@gmail.com> wrote:
Thank you, Allison and Hyukjin.

IIUC, this proposal is not about a single file. SPIP already exposes multiple files which may increase our documentation and website size twice (or more in the worst case) because it's simply a duplication of the content. If we start to use AI tools to generate these LLMS.txt files, it could be much bigger than the original.

*** From SPIP ***
***

Since the size of Apache Spark 4.1.0-preview1 documentation is 1.2GB, could you propose to limit the total size of newly added llms.txt files under 10MB always systematically, Allison? If we don't have full controllability, this duplication will break the ASF Spark website like last year. We already inevitably archived old Spark documents from the original website location to "https://archive.apache.org/dist/spark/" due to the CI outage.

$ du -h 4.1.0-preview1 | tail -n1
1.2G 4.1.0-preview1

The bottom line is that we need to have a clear hard limit for this newly proposed duplication for machine-friendly metadata. If we have a systematic way to control the upper bound which is less than 10MB per Spark version in total (now and forever), it sounds like a good addition.

Thanks,
Dongjoon.


On Tue, Sep 9, 2025 at 7:19 PM Allison Wang <allisonw...@apache.org> wrote:
Yes, that’s right. It’s essentially just one markdown file to start with, and we can add more later for language or version specific files if needed.

On Tue, Sep 9, 2025 at 4:32 PM Hyukjin Kwon <gurwls...@apache.org> wrote:
so it's basically adding one text file for llm, right? I think it's a good idea.

On Tue, 9 Sept 2025 at 10:22, Allison Wang <allisonw...@apache.org> wrote:
Hi all, 

I’d like to propose adding llms.txt files to the Spark documentation.

As more users rely on AI-assisted tools and LLMs to learn, write Spark code, and troubleshoot issues, it’s increasingly important that these tools point back to the up-to-date official documentation. This will help improve code generation quality and make new Spark features easier to discover. The emerging llms.txt convention provides a lightweight way to curate LLM-friendly manifests of key documentation links. 

Would love to hear your feedback!


Thanks,
Allison


--
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

Reply via email to