Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Mario Carneiro Sat, 18 Feb 2023 13:08:39 -0800

> > While I have some sympathy for this argument, I also think it is holding
> us back a lot. Catering to the least common denominator of no-JS means that
> our web pages are stuck permanently in web 1.0, with no search
> functionality, intra-page links, subproof expansion, mathml rendering,
> go-to-definition for symbols, or a laundry list of other features.
> Furthermore, doing the rendering client-side means that the cost of
> transmission is lower which means shorter load times and no need for a
> two-hour processing phase.
>
> What I want is for *a* way for people to access the basic information
> without *requiring* client-side JavaScript. I see no reason that it be the
> *only* way we provide the information, we already provide it in multiple
> forms. I just want to retain that capability. So you don't need to feel
> held back :-).
>


In case it wasn't clear (and it probably wasn't), in a future world in
which the default mechanism is some super fancy client side renderer it is
still possible to offer a fallback "static site" of HTML pages which is
nevertheless not generated in advance and is being served dynamically
server-side. This has a relatively small generation cost even if we don't
stick a CDN in front of it, but if we do then I think that's a viable
approach to having all the HTML pages with basically zero preprocessing,
just a smarter server. I would really love it if the only thing the site
build needs to do is download and load the latest set.mm .


> Indeed, you can implement rendering with client-side JavaScript today,
> just download the .mm files.
>

I don't think this is a very competitive option though, since you would
have to load all of set.mm. The advantage of pre-processed json files is
that when you have a single theorem you want to look at you don't need to
request everything and start up a whole client-side MM verifier first.

The current CORS setting means that any client-side JavaScript program can
> do that, it doesn't even need to be hosted on us.metamath.org. So
> technically people today can view the generated HTML on the website, *or*
> use any client-side JavaScript/WebAssembly. We "just" need someone to write
> the client-side renderer.


An in-browser .mm browser and proof assistant is a very cool project, but
somewhat distinct from what I'm talking about, which is a replacement for
the existing proof docs. That's not to say that we shouldn't have such a
thing available as well on metamath.org as a kind of "try it online" mode.
But the heavy startup cost will definitely make it feel like a different
experience.


> Also, almost everything you listed doesn't require client-side JavaScript.
> Intra-page links work (they're just hyperlinks), subproof expansion works
> (use HTML <details>), MathML ships in Chromium & Chrome & Firefox
> (Firefox's support is only partial but it's probably more than enough for
> us) per <https://developer.mozilla.org/en-US/docs/Web/MathML>,
> go-to-definition is just another hyperlink, and so on.
>

While it is true that you can do almost all of that with static HTML, it
would make the pages far heavier than they currently are. Load times for
large proof pages are already pretty bad and this would make it a lot
worse. Not to mention that more text means more pre-processing time during
site build.

Basic searches can be supported via a form that queries a search engine
> (e.g., Google).
>

Oof. I'm sure you know that this is not a very good option for a host of
reasons. It's basically equivalent to not having a search mechanism at all,
since google already indexes the site.


> As far as speed goes, JavaScript is often slower than HTML where both can
> be used. In practice JavaScript files are often quite large, and JavaScript
> has to be parsed, examined, executed, and then its execution must cause
> document tree changes which finally get rendered. Web browsers are *really*
> good at rapidly & progressively rendering HTML. On systems I've worked
> with, HTML (even with server roundtrips) is often *MUCH* faster when
> measuring paint-to-screen times. JavaScript *has* gotten faster over the
> years, thankfully, but the gap is still large. Of course, when you need
> functionalities only provided by JavaScript then you need it. Like all
> technologies there are tradeoffs.
>

I am imagining the main purpose of the JS here is to pump up a basic
version of the page into the fully interactive / hyperlinked version of the
page, without having to put loads of redundant text in all the tags. There
is no need for a heavy JS library here; I did essentially this for the MM0
docs (https://digama0.github.io/mm0/peano/thms/expr.html) and the JS
support library for it (https://digama0.github.io/mm0/peano/proof.js) is
very light and does not need to come with every page. (There is still more
that can be done to make the MM0 docs more interactive but it helps a lot
if there is an actual server on the other end rather than just a static
site.)


> > Is there much reason not to create JSON in addition to HTML? If disk
> space is a concern, consider that it can be a win-win to store
> precompressed pages (i.e. with gzip).
>
> No *big* reason. Indeed, if we want to make it *easy* for people to
> generate their own renderers or analyze Metamath results without
> specialized parsing, the obvious way to do that would be to generate JSON.
> I think that'd make a lot of sense. That still involves a daily pre-run,
> but I don't think there's any serious problem in doing 1/day regenerations.
>
> We'd have to figure out specifics, which means we need to figure out how
> they'd be used. If we generate a separate JSON file for each theorem, then
> it's really easy for a rendered to get just what it needs. However, that's
> a lot of files, and compression might not help our storage needs. The
> filesystem block size is 4KiB, so anything under 4KiB is 4KiB storage. We
> could generate a single massive JSON file of all theorems; that might be
> useful for ML training if they plan to process the JSON further, but the
> file would be huge so I think a client-side program would be better off
> loading the .mm database directly instead. A massive JSON file might be
> useful for other purposes, and that *would* be best stored compressed.
>
> We currently don't have *lots* of extra storage space. If we add a major
> new set of generated data it's likely we'll need more space. We could
> retire mpegif, add more storage ($ that currently I pay, but maybe others
> could chip in), or look for another provider ($?).
>

Just to say once more: our current setup allows us to run a program on the
server, and this is important: it means that we can generate all files on
demand and use nginx only for caching so that the cost of generation (which
should be a few milliseconds if everything is in memory already) is paid
only once. That applies both to the HTML pages as well as any json API
endpoints we add.


> What started this thread was trying to see if we could get "free" training
> on GPT-3 by presenting something easier for them to consume, *without*
> controlling them. I don't think generating JSON will do that. The JSON
> certainly could make it easier for organizations who want to specially
> train ML systems on the existing databases, and are willing to do "a little
> cleaning" by processing the JSON. If we pregenerate JSON from the .mm
> databases, the most likely use cases would be to make it easier for
> rendering tools & ML labs that want to do a little (but not a lot) of work.
>

Has anyone asked for this? Will we be able to see the results of the work
and know that it was useful? It is very strange to me to spend such a
significant fraction of our time, processing power, and space on something
that is effectively just intended for web crawlers and with loose
constraints on the actual formats involved.

If we really want to use plain text for ML, why not feed the robots .mmp
proofs instead? At least that way the stuff they regurgitate might
potentially be useful as a poor version of an ATP. (I'm pretty skeptical
even of that. I think it is most likely only to be good for a laugh. An ML
trained for actual logical correctness needs a much better data source and
architecture than just a transformer trained from text on the internet.
That's not to say that you can't build such ML models, GPT-f is one of
them, but these models work in a much more controlled environment like
"guess the next theorem to apply" rather than "write something that looks
like a proof".)

-- 
You received this message because you are subscribed to the Google Groups 
"Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to metamath+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/metamath/CAFXXJStq1ru55Pngt%2By%3DJLFxK2v0tTubcur%2B%3D2k4%2BBZQXzNz2A%40mail.gmail.com.

Re: [Metamath] Idea: Release ".txt" proofs for set.mm databases ("mpetxt") for AI/ML tools, esp Generative Pre-trained Transformers (GPTs)

Reply via email to